← Back to all sparks
O

ONNX Runtime

AI-ASSISTANTS
Velocity3.8

Cross-platform inference and training engine for ONNX-format machine-learning models.

ONNX Runtime is unbundling its execution providers into independently shippable plugins.

execution-providersplugin-architecturellm-inferencequantizationsecurity-hardeningwebgpu
Current state
ONNX Runtime is mid-transition to a plugin-based execution-provider architecture: EPs that were once compiled into the core binary now ship as separately versioned libraries that register at runtime. Recent releases pair heavy LLM-oriented kernel work (attention, quantized MatMul/MoE, KV-cache) with deep security hardening across operators.
Where it's heading
The directional move is decoupling: the CUDA Plugin EP landed in 1.25, and the WebGPU EP has now shipped as a standalone plugin against any compatible ORT install. This lets EPs iterate on their own cadence and lets third parties deliver hardware backends without rebuilding ORT, while the core focuses on LLM inference primitives and breaking platform-baseline raises (C++20, CUDA 12->13).
Prediction
Expect more first-party EPs (TensorRT, QNN, CoreML) to migrate to the plugin model and a published, stable plugin-EP API surface as the default integration path.

Recent moves

  1. 15d ago

    ONNX Runtime v1.27.0

    v1.27.0 deepens the plugin EP API (zero-copy I/O, CUDA plugin options), adds LLM kernel work across CUDA/WebGPU/CPU, new datatypes (FLOAT8E8M0), and an extensive batch of security and correctness fixes.

    View source ↗
  2. 1mo ago

    ONNX Runtime WebGPU Plugin EP v0.1.0

    ⚡ SPARK

    First standalone release of the WebGPU EP as a plugin, distributed and versioned independently of the core onnxruntime binary and registered at runtime against ORT 1.24.4+.

    View source ↗
  3. 2mo ago

    ONNX Runtime v1.25.1

    Patch release adding newer opset versions and LinearAttention/CausalConvState/RotaryEmbedding/RMSNorm operators for Qwen3.5 support, plus WebGPU decode-path optimizations.

    View source ↗
  4. 2mo ago

    ONNX Runtime v1.25.0

    ⚡ SPARK

    v1.25.0 introduces the CUDA Plugin EP, the first core implementation letting third-party CUDA-backed EPs load as dynamic plugins without rebuilding ORT, alongside breaking baseline raises (C++20, CUDA 12 minimum) and broad LLM op coverage.

    View source ↗
  5. 3mo ago

    ONNX Runtime v1.24.4

    Patch release for the 1.24 line with core/plugin-EP null-pointer and MetaDef ID fixes, QNN EP tweaks, and a Python 3.10-drop version bump.

    View source ↗
  6. 4mo ago

    ONNX Runtime v1.24.3

    Patch release for 1.24 with several security/OOB fixes, a 4x QMoE CPU speedup on 4-bit, and assorted EP and build fixes.

    View source ↗