MMLU Timeline MicroSim

<iframe src="https://dmccreary.github.io/tracking-ai-course/sims/mmlu-timeline/main.html"  height="550px" scrolling="no"
  style="overflow: hidden;"></iframe>

Notes

I'll search for the most recent MMLU benchmark data to update your dataset. Let me gather that information for you.

Let me search for more recent and specific MMLU benchmark data:

Let me search for more specific MMLU benchmark scores:

Based on my research, I've gathered the most recent MMLU benchmark data to update your dataset. Here's an updated version with the latest models and their scores:

Based on my research, I've updated the MMLU benchmark data with the most recent results from various AI models through April 2025. Here's what I found:

Latest MMLU Benchmark Scores (as of May 2025)

The interactive chart now includes the most recent benchmark scores from the leading AI models:

GPT-4.1: Achieves 90.2% on the MMLU benchmark as reported by OpenAI in April 2025 OpenAI highlights GPT-4.1's "strong 90.2% score on the Massive Multitask Language Understanding (MMLU) benchmark"
GPT-4o1 (o1-high): Scored around 91.5% as of February 2025, representing OpenAI's reasoning model
Claude 3.7 Sonnet: Scores around 87-88% based on various reports, with some sources showing scores in the 80-85% range Claude 3.7 Sonnet has "a MMLU score of 0.803" while other sources indicate higher scores
Claude 3.5 Sonnet: Achieved 88.7% in June 2024 Claude 3.5 Sonnet "proves that it can complete natural language tasks with better quality than its predecessors and competitors such as GPT-4, with a score of 88.7% in the MMLU benchmark"
Gemini 2.5 Pro: Reaches approximately 89.8% as of April 2025, making it highly competitive with the best models from OpenAI
Llama 4 (Maverick): Achieves 85.5% On MMLU, GPT-4o's rumored 87-88 range exceeds Maverick's 85.5%
Grok 3: Reportedly tops several benchmarks with an impressive 92.7% on MMLU Grok 3 "tops several academic benchmarks (92.7% MMLU, ~89% GSM8K for math) with a massive 2.7-trillion-parameter design"

The chart now includes data from 2020 through April 2025, showing the dramatic improvement in model performance over this period, from the early GPT-3 scores in the 40% range to today's frontier models exceeding 90%.

Notable Trends

The past year has seen intense competition, with scores improving from the mid-80s to over 90%
Different measurement methodologies can affect scores, making direct comparisons challenging Model creators have reported MMLU scores using non-standard prompting techniques
The gap between leading models has narrowed significantly, with several top models now performing within a few percentage points of each other

This updated visualization should help track the rapid advancement in AI capabilities as measured by the MMLU benchmark, which tests knowledge across 57 subjects ranging from STEM fields to humanities.

References

Data Is Beautiful Visualization