Text2KGBench-LettrIA F1 Scores Timeline
An interactive visualization showing how language models perform on the Text2KGBench-LettrIA Generalization benchmark over time, combining data from Text2KGBench (2023) and Text2KGBench-LettrIA (2025).
Overview
This chart displays F1 scores from two related benchmarks:
- Text2KGBench (2023) - The original benchmark by Mihindukulasooriya et al. for evaluating ontology-guided KG construction from text
- Text2KGBench-LettrIA (2025) - A refined version by Plu et al. with improved ontologies and annotations
Metrics
- Overall F1: Combined score for knowledge graph extraction (available for all models)
- Entities (E): F1 score for correctly identifying entity classes
- Attributes (A): F1 score for correctly extracting literal values
- Properties (P): F1 score for identifying datatype properties
- Relations (R): F1 score for identifying object properties
Note: The detailed E/A/P/R breakdown is only available for Text2KGBench-LettrIA (2025) models.
Key Findings
-
Dramatic improvement from 2023 to 2025: Early open-weights models (Vicuna, Alpaca) achieved only 0.25-0.30 Overall F1, while modern models reach 0.85+
-
Fine-tuned open models outperform proprietary: Models like Gemma 3 27B-IT and Qwen3 8B achieve higher scores than zero-shot proprietary models
-
Top performers (2025): Claude Sonnet 4 (0.87), Gemini 2.5 Pro (0.86), and fine-tuned Qwen3 32B (0.85)
Data Summary
Text2KGBench (2023) - Original Benchmark
| Model | Publisher | Release | Overall F1 |
|---|---|---|---|
| Vicuna-13B | Open-weights | 2023-03 | 0.30 |
| Alpaca-LoRA-13B | Open-weights | 2023-03 | 0.25 |
Text2KGBench-LettrIA (2025) - Top Performers
| Model | Publisher | Release | Overall | Entities | Attributes | Properties | Relations |
|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | Anthropic | 2025-05 | 0.870 | 0.783 | 0.951 | 0.928 | 0.718 |
| Gemini 2.5 Pro | 2025-03 | 0.860 | 0.775 | 0.958 | 0.937 | 0.724 | |
| Qwen3 32B (FT) | Alibaba | 2025-05 | 0.851 | 0.775 | 0.920 | 0.898 | 0.709 |
| Gemma 3 12B-IT (FT) | 2025-03 | 0.845 | 0.838 | 0.928 | 0.892 | 0.722 | |
| Mistral Small 3.2 (FT) | Mistral | 2025-06 | 0.843 | 0.801 | 0.937 | 0.911 | 0.722 |
(FT) = Fine-tuned on Text2KGBench-LettrIA
Features
Interactive Controls
- Metric selector: Switch between Overall F1 and detailed E/A/P/R metrics
- Model type filter: View all models, closed/proprietary only, or open-weights only
- Benchmark filter: Compare across benchmarks or focus on one version
- Hover tooltips: See all available scores for any data point
Visual Encoding
- Circle = Closed/Proprietary models (zero-shot)
- Triangle = Open-weights models (fine-tuned)
- Colors by publisher: Anthropic (orange), Google (blue), OpenAI (green), Mistral (coral), Alibaba (purple), Microsoft (cyan)
Classroom Discussion
Have students examine the visualization and discuss the following questions:
Trend Analysis
-
Performance Trajectory: What overall trend do you observe in F1 scores from 2023 to 2025? What factors might explain this rapid improvement?
-
Open vs. Closed Models: Compare the performance of fine-tuned open-weights models to zero-shot proprietary models. What implications does this have for organizations building knowledge graph applications?
-
Metric Breakdown: Switch between the different metrics (E, A, P, R). Which extraction task appears most challenging for models? Why might Relations (R) scores be consistently lower than Attributes (A)?
Implications for Intelligent Textbooks
-
Knowledge Graph Quality: If an intelligent textbook uses LLMs to automatically extract concepts, relationships, and attributes from educational content, what minimum F1 score would you consider acceptable? How would extraction errors impact learning?
-
Cost-Performance Tradeoffs: Consider that smaller fine-tuned models (e.g., Qwen3 4B at 0.81) approach the performance of large proprietary models. What are the implications for deploying knowledge extraction in resource-constrained educational settings?
-
Benchmark Evolution: The refinement from Text2KGBench (2023) to Text2KGBench-LettrIA (2025) improved annotation quality. How might benchmark improvements influence the development of educational AI systems?
Design Exercise
- Building a Learning Graph Pipeline: Design a pipeline that uses text-to-knowledge-graph models to automatically generate learning graphs from textbook chapters. Consider:
- Which model(s) would you choose and why?
- How would you handle extraction errors?
- What human-in-the-loop validation would be needed?
- How would you ensure concept dependencies are correctly identified?
Technical Details
- Library: Chart.js 4.4.0 with date-fns adapter
- Chart Type: Scatter plot with time scale (quarterly)
- Time Range: January 2023 to January 2026
- Source Files:
main.html- HTML structure and stylingscript.js- Chart.js visualization logic- data.json - Model benchmark data (34 models with scores)
References
- Text2KGBench (2023) - Mihindukulasooriya et al.
- Text2KGBench-LettrIA (2025) - Plu et al.
- Chart.js Documentation