Transformers
Hugging Face library providing thousands of pretrained models for NLP, vision, and audio.
Steady cadence of MoE model adds and tokenizer patches — the library is doing its job.
◆Recent moves
- 1d ago
Release v5.9.0
v5.9.0 adds Cohere's Command A+ (Cohere2Moe — hybrid sliding-window/full attention, shared and routed experts, long context), NVIDIA's Parakeet TDT, and HRM-Text. Another routine minor release on the every-two-weeks model-add cadence.
View source ↗ - 8d ago
Patch release v5.8.1
Patch primarily to fix DeepSeek V4: ContinuousBatchingManager fatal-error handling, a WeightConverter regex that was misclassifying shared experts as routed experts, and a CSA mask collapse bug. The kind of follow-up that always lands a week after a major vendor model addition.
View source ↗ - 15d ago
Release 5.8.0
v5.8.0 adds DeepSeek-V4 (Flash, Pro, and Base variants) with substantial architectural detail — hybrid local+long-range attention replacing MLA, Manifold-Constrained Hyper-Connections in place of residuals, and hash-table bootstrapped MoE layers. Notable for the depth of architectural novelty being absorbed in a routine release.
View source ↗ - 22d ago
Release v5.7.0
v5.7.0 adds Laguna, Poolside's MoE family with per-layer head counts and a sigmoid router using auxiliary-loss-free load balancing. Fits the broader pattern of vendors shipping increasingly idiosyncratic MoE variants and transformers absorbing each one.
View source ↗ - 27d ago
Fix Qwen 3.5/3.6 MoE FP8 inference
Patch fixing Qwen 3.5/3.6 MoE FP8 inference, broken by an earlier kernels configuration change. Single-vendor quantization regression of the kind these patch releases routinely catch.
View source ↗ - 28d ago
Fix broken flash-attention forward path
Patch fixing a broken flash-attention forward path caused by an unhandled s_aux=None case. Hot-fix of a regression the prior minor release introduced.
View source ↗