Arize AI
AI observability and LLM evaluation platform for monitoring model performance in production.
Arize stakes a flag in coding-agent observability while reframing Phoenix into agent context
◆Recent moves
- 2d ago
How to build LLM-as-a-Judge evaluators that hold up in production
Playbook for building LLM-as-a-judge evaluators that hold up in production, anchored to Phoenix Evals with fixed labels, human-agreement checks, and trace context. Productizes a pattern most teams currently re-invent and reinforces Arize's positioning as the authority on evaluation done right.
View source ↗ - 2d ago
What we learned testing 7 models under the same agent harness
Research piece arguing model swaps behave like product migrations rather than configuration changes, with measured behavior across seven models under the same agent harness. Reinforces the 'agents are systems, not models' narrative and pre-empts customer churn by signaling expertise on swap risk.
View source ↗ - 3d ago
Building a self-improving agent on a context graph of human disagreement
Technique post showing how to build measurably better agents from human-correction data without retraining, by capturing disagreement as a context graph. A strong research signal even though framed as how-to rather than a productized feature.
View source ↗ - 5d ago
Coding agent tracing and evaluation: An open source tool to improve AI coding workflows
⚡ SPARKArize ships an open-source harness for tracing and evaluating coding-agent workflows across Claude Code, Cursor, Codex, GitHub Copilot, and Gemini CLI. Plants Arize directly in the highest-traffic agent surface and extends Phoenix into a new category.
View source ↗ - 9d ago
How we use Alyx to build Alyx: How to build an AI agent feedback loop
Dogfooding post on using Arize's own Alyx agent to debug and improve Alyx — searching dense traces, aggregating failures, triaging dogfooding issues. Proof point that the platform survives sustained internal use, not just demos.
View source ↗ - 11d ago
Models got an order of magnitude better at following instructions in one year
Benchmark research showing frontier models' instruction-following capacity grew from roughly 200-300 to roughly 2,000 simultaneous constraints in one year. Useful baseline for the broader pitch that agent reliability bottlenecks have shifted from the model to the surrounding system.
View source ↗