Chapter 2 Quiz — AI Capability Curve¶

Test your understanding of how AI capabilities are measured, how fast they are growing, and what that growth means for educational planning. Questions cover Remember, Understand, Apply, and Analyze levels of learning.

Questions¶

1. What is an AI benchmark, and why are benchmarks used to track AI progress?

Answer: An AI benchmark is a standardized test — a set of problems or tasks with known correct answers — used to measure how capable an AI model is in a given area. Benchmarks allow researchers, vendors, and policymakers to compare models objectively and track improvement over time. For educational leaders, benchmarks provide evidence-based signals about whether AI tools are genuinely improving or simply being marketed as improved.

2. What does the MMLU benchmark measure, and what does a model's MMLU score tell us?

Answer: MMLU stands for Massive Multitask Language Understanding; it tests a model's knowledge across 57 academic subjects ranging from elementary mathematics to professional law and medicine. A high MMLU score indicates that a model can answer expert-level questions across many disciplines. For education, this benchmark matters because it reveals whether an AI tutoring or content tool has sufficient subject-matter knowledge to be trusted with students.

3. What is Task Horizon, and why is it a useful concept for understanding AI capability growth?

Answer: Task Horizon refers to the length and complexity of tasks that an AI agent can complete reliably without human intervention — measured in minutes or hours of equivalent human work time. As task horizon grows, AI moves from answering simple questions to completing multi-step projects autonomously. For schools, tracking task horizon helps administrators understand which administrative and instructional workflows are becoming feasible to automate.

4. What did the METR Study measure, and what was its significance for AI capability forecasting?

Answer: The METR Study measured how AI agents performed on real-world software engineering tasks of varying complexity, establishing a quantitative relationship between task length and AI success rates. Its significance is that it provided empirical data showing AI capability roughly doubling on a measurable timeline, moving this from speculation to tracked metric. Educational leaders can use this type of evidence to ground their planning in data rather than vendor claims.

5. What is 'Fifty Percent Reliability,' and why does this threshold matter for deploying AI in schools?

Answer: Fifty Percent Reliability refers to the point at which an AI agent succeeds at a given category of tasks approximately half the time without human help. While this sounds low, it marks a critical threshold: tasks at this level become candidates for human-AI collaboration where the AI handles the first draft and a human reviews it. For schools, identifying which tasks have crossed this threshold helps prioritize where to introduce AI assistance first.

6. What does 'Capability Doubling' mean in the context of AI, and roughly how often has it been observed?

Answer: Capability Doubling means that the range of tasks AI can reliably complete has been growing exponentially, with the task horizon approximately doubling every few months based on recent empirical measurements. This rapid pace means that capabilities available only in research labs today may be standard in commercial tools within one to two years. School leaders should plan AI strategies with the expectation that the tools available at implementation will be significantly more powerful than those available during planning.

7. How is AI capability growth similar to Moore's Law, and what are the limits of that analogy?

Answer: Like Moore's Law — which described the doubling of transistors on a chip roughly every two years — AI capability has shown an exponential growth pattern on benchmark performance and task horizon. The analogy is useful for communicating to boards and parents that AI improvement is not random but follows a measurable curve. However, the analogy has limits: Moore's Law eventually slowed due to physical constraints, and it is unknown whether AI capability growth will similarly plateau or continue at its current pace.

8. What is the difference between a Reasoning Model and a standard language model, and why does it matter for education?

Answer: A Reasoning Model is an LLM specifically trained or prompted to work through problems step by step — breaking down complex questions, checking intermediate conclusions, and arriving at more reliable answers. Standard language models generate text by predicting the next token without explicit multi-step reasoning. For education, reasoning models are better suited for mathematics, logic, and scientific problem-solving, making them more trustworthy as tutors in STEM subjects.

9. What is benchmark saturation, and why does it complicate comparing AI models?

Answer: Benchmark saturation occurs when AI models score so highly on a benchmark — often near 90–100% — that the benchmark can no longer distinguish between models or measure further improvement. When a benchmark is saturated, vendors may still cite high scores even though the test no longer captures meaningful differences in capability. Educational buyers should look for recently released, harder benchmarks rather than relying on scores from tests that top models have already mastered.

10. What is the relationship between Intelligence and Price in AI, and what does this trend mean for schools with tight budgets?

Answer: Over time, the cost of a given level of AI capability has fallen dramatically — tasks that required expensive frontier models a year ago can often be handled by smaller, cheaper models today. This Intelligence-versus-Price trend means that even schools with limited technology budgets can access increasingly powerful AI tools as prices continue to decline. Budget planning should account for this trend by avoiding long-term contracts that lock in today's high prices for tomorrow's commodity capabilities.

11. What is an Agentic Task, and how does it differ from a One-Shot Task?

Answer: A One-Shot Task is a request that an AI completes in a single response — such as answering a question or summarizing a paragraph. An Agentic Task requires the AI to plan, take multiple sequential actions, use tools, and adapt based on intermediate results — such as researching a topic, drafting a report, and formatting it for a specific audience. As AI agents become capable of longer agentic tasks, they can take over more substantial portions of administrative and instructional workflows.

12. What is Model Release Cadence, and why should school technology directors track it?

Answer: Model Release Cadence refers to the frequency with which AI providers release new, more capable versions of their models — which has been roughly every six to twelve months for leading providers. Technology directors should track this because each new release may obsolete current workflows, open new possibilities, or require policy updates. Planning cycles for AI adoption should be short enough to incorporate new model capabilities without being so reactive that they create constant disruption.

13. What does 'Adoption Versus Capability' mean as a concept in AI strategy?

Answer: Adoption Versus Capability describes the gap between what AI can do technically and how widely it is actually used in practice. Historically, AI capability has advanced faster than institutional adoption, meaning schools often lag behind available technology. Recognizing this gap helps leaders understand that the challenge is not only acquiring capable tools but also building the culture, training, and processes to use them effectively.

14. What is Cost Per Task in AI, and how does it affect the business case for AI adoption in education?

Answer: Cost Per Task is the price of having an AI complete one unit of work — such as generating one quiz, grading one essay, or producing one lesson plan. As AI capabilities grow and competition among providers increases, cost per task has been falling rapidly. For education, a lower cost per task makes it financially viable to automate previously manual work at scale, strengthening the return-on-investment argument for AI adoption even in budget-constrained districts.

15. Why is understanding Capability Forecasting important when making multi-year AI purchasing decisions?

Answer: Capability Forecasting is the practice of projecting what AI systems will be able to do over a future time horizon, based on observed growth curves and benchmark trends. Multi-year purchasing decisions — such as a three-year curriculum platform contract — may lock a district into technology that will be far less capable than what becomes available during the contract period. Leaders who understand capability forecasting can negotiate more flexible contracts and avoid overpaying for today's capabilities when tomorrow's will be substantially better.