← Back to all sparks
T

Transformers

AI-ASSISTANTS
Velocity2.5

Hugging Face library providing thousands of pretrained models for NLP, vision, and audio.

Steady cadence of MoE model adds and tokenizer patches — the library is doing its job.

mixture-of-expertsmodel-integrationpatch-cadencetokenizersattention-variants
Current state
Transformers is in a routine release rhythm: a minor release every two-to-three weeks adding new model families (Cohere2Moe, DeepSeek-V4, Laguna from Poolside, Parakeet, HRM-Text, OpenAI Privacy Filter), interleaved with patch releases that fix tokenizers, attention paths, and vendor-specific integration bugs (Qwen 3.5/3.6 FP8, Kimi-K2.5 tokenizer, Gemma4 device-map). Mixture-of-experts is the dominant architecture in this window — most newly added models are MoE variants.
Where it's heading
The library is consolidating its position as the reference implementation for new model architectures: as soon as a vendor ships a frontier model, the corresponding transformers integration lands within days or weeks. MoE-with-novel-routing (sigmoid routers, expert-id hashing, hybrid attention) is becoming the default architectural assumption, and transformers is absorbing the variations without major API churn. The patch-release pattern — flash-attention paths, FP8 quantization fixes, tokenizer regressions — shows the maintenance load is concentrated at the integration edges, not the core.
Prediction
The next minor release will almost certainly add another two-to-four MoE models on the current cadence, and the next patch release will land within a week to fix whatever quantization or tokenizer regression slipped through. Watch for a deeper refactor of the MoE routing abstractions if vendor architectures keep diverging — the current per-model branches are accumulating.

Recent moves

  1. 1d ago

    Release v5.9.0

    v5.9.0 adds Cohere's Command A+ (Cohere2Moe — hybrid sliding-window/full attention, shared and routed experts, long context), NVIDIA's Parakeet TDT, and HRM-Text. Another routine minor release on the every-two-weeks model-add cadence.

    View source ↗
  2. 8d ago

    Patch release v5.8.1

    Patch primarily to fix DeepSeek V4: ContinuousBatchingManager fatal-error handling, a WeightConverter regex that was misclassifying shared experts as routed experts, and a CSA mask collapse bug. The kind of follow-up that always lands a week after a major vendor model addition.

    View source ↗
  3. 15d ago

    Release 5.8.0

    v5.8.0 adds DeepSeek-V4 (Flash, Pro, and Base variants) with substantial architectural detail — hybrid local+long-range attention replacing MLA, Manifold-Constrained Hyper-Connections in place of residuals, and hash-table bootstrapped MoE layers. Notable for the depth of architectural novelty being absorbed in a routine release.

    View source ↗
  4. 22d ago

    Release v5.7.0

    v5.7.0 adds Laguna, Poolside's MoE family with per-layer head counts and a sigmoid router using auxiliary-loss-free load balancing. Fits the broader pattern of vendors shipping increasingly idiosyncratic MoE variants and transformers absorbing each one.

    View source ↗
  5. 27d ago

    Fix Qwen 3.5/3.6 MoE FP8 inference

    Patch fixing Qwen 3.5/3.6 MoE FP8 inference, broken by an earlier kernels configuration change. Single-vendor quantization regression of the kind these patch releases routinely catch.

    View source ↗
  6. 28d ago

    Fix broken flash-attention forward path

    Patch fixing a broken flash-attention forward path caused by an unhandled s_aux=None case. Hot-fix of a regression the prior minor release introduced.

    View source ↗