Tag: benchmarks
All the articles with the tag "benchmarks".
-
NVIDIA AgentPerf Makes Agent Serving A Datacenter Metric
NVIDIA's AA-AgentPerf results show why agent workloads need new infrastructure benchmarks built around concurrent sessions, tool calls, and latency.
-
ADK Arena Is A Reality Check For Agent Frameworks
ADK Arena tests agent frameworks across real benchmark tasks and finds what builders already feel: no framework owns the agent stack yet.
-
Constraint Decay Is Why Coding Agents Break in Real Repos
A new arXiv paper found coding agents lose about 30 points as structural backend constraints accumulate. The lesson is simple: demos reward output; production rewards constraint discipline.
-
Every Frontier AI Model Just Scored Below 1% on a Reasoning Test. Humans Score 100%.
ARC-AGI-3 is the first interactive reasoning benchmark for AI agents. Gemini scored 0.37%. GPT-5.4 scored 0.26%. Claude scored 0.25%. Humans solve every single one. The gap is not closing.