AI Is All You Need

LLM Benchmarking Shows Capabilities Doubling Every 7 Months

IEEE Spectrum - AI

Jul 2, 2025 15:37

Glenn Zorpette

1 views

airesearchieeetechnology

Summary

A new benchmarking approach from Berkeley’s METR think tank measures LLM progress by comparing model performance to human completion times across tasks of varying complexity. Their findings show that LLM capabilities are doubling roughly every seven months, highlighting exponential improvement and underscoring the need for updated evaluation methods as AI rapidly advances. This rapid progress has significant implications for both the development and oversight of AI systems.

The main purpose of many large language models (LLMs) is providing compelling text that’s as close as possible to being indistinguishable from human writing. And therein lies a major reason why it’s so hard to gauge the relative performance of LLMs using traditional benchmarks: quality of writing doesn’t necessarily correlate with metrics traditionally used to measure processor performance, such as instruction execution rate. RELATED: Large Language Models Are Improving Exponentially But researchers at the Berkeley, Calif. think tank METR (for Model Evaluation & Threat Research) have come up with an ingenious idea. First, identify a series of tasks with varying complexity and record the average time it takes for a group of humans to complete each task. Then have various versions of LLMs complete the same tasks, noting cases in which a version of an LLM successfully completes the task with some level of reliability, say 50 percent of the time. Plots of the resulting data confirm that as

LLM Benchmarking Shows Capabilities Doubling Every 7 Months

Summary

Related Articles

Zuck Wrong About the Metaverse. Can We Trust Him with Superintelligent AI?

Ethereum Reclaims $2,550: Key Price Levels to Watch Now

Bitcoin Price Flashes Mixed Signals After Third Failed $110K Breakout Attempt