Text-to-Knowledge-Graph Benchmark Comparison
An interactive visualization comparing AI model performance on text-to-knowledge-graph extraction tasks across multiple published benchmarks.
Embed This MicroSim
Copy this iframe to embed in your own website:
1 | |
Overview
This chart aggregates published benchmark results from multiple text-to-knowledge-graph evaluation frameworks, allowing comparison of how different language models extract structured knowledge graph triples from natural language text.
Misleading Metric Claim
Lettria claims a 99.8% score for their fine-tuned Gemma 3 27B model. However, this is a reliability score (percentage of outputs the system can parse), not an F1 score. A model can have high reliability (outputs are well-formatted) while still having lower F1 accuracy (extracted triples are incorrect). The actual F1 scores for their fine-tuned model are not provided in the text. Lettria Perseus Reference
Included Benchmarks
| Benchmark | Datasets | Description | Source |
|---|---|---|---|
| Text2KGBench | Wikidata-TekGen, DBpedia-WebNLG | Original ontology-driven KG extraction benchmark | Mihindukulasooriya et al., 2023 |
| Text2KGBench-LettrIA | DBpedia-WebNLG (refined) | Refined benchmark with improved data quality | Plu et al., 2025 |
| KG-Generation | General | Comparative study of KG generation | Trajanov et al., 2024 |
| Sepsis-KG | Medical/Sepsis domain | Domain-specific KG construction | Wang et al., 2025 |
Current Results Summary
| Model | F1 Score | Benchmark | Dataset |
|---|---|---|---|
| GPT-4 | 0.82 | KG-Generation | General |
| LLaMA 2 | 0.77 | KG-Generation | General |
| GPT-4 | 0.77 | Sepsis-KG | Sepsis |
| BERT | 0.72 | KG-Generation | General |
| Claude Opus 4.5 (Est.) | ~0.70 | Text2KGBench-LettrIA | DBpedia-WebNLG (refined) |
| Gemini 2.5 Pro | 0.6595 | Text2KGBench-LettrIA | DBpedia-WebNLG (refined) |
| Claude Sonnet 4 | 0.6487 | Text2KGBench-LettrIA | DBpedia-WebNLG (refined) |
| GPT-4.1 | 0.6472 | Text2KGBench-LettrIA | DBpedia-WebNLG (refined) |
| Llama 3 | 0.48 | Sepsis-KG | Sepsis |
| Qwen2 | 0.44 | Sepsis-KG | Sepsis |
| Vicuna-13B | 0.35 | Text2KGBench | Wikidata-TekGen |
| Vicuna-13B | 0.30 | Text2KGBench | DBpedia-WebNLG |
| Alpaca-LoRA-13B | 0.27 | Text2KGBench | Wikidata-TekGen |
| Alpaca-LoRA-13B | 0.25 | Text2KGBench | DBpedia-WebNLG |
Estimated Value
The Claude Opus 4.5 result (~0.70 F1) is an estimate, not a published benchmark result. It is based on typical performance improvements between Claude Sonnet 4 (0.6487) and Opus models on reasoning tasks (~5-15% improvement). The actual performance may differ.
Note: Results from different benchmarks are not directly comparable due to differences in evaluation methodology, datasets, and task definitions.
Features
Interactive Elements
- Dataset Filter: Select specific datasets or benchmarks to compare
- Metric Filter: Filter by evaluation metric (F1_overall, Precision, Recall)
- Hover Tooltips: View exact scores by hovering over bars
- Clickable Legend: Click model names to show/hide their results
Visual Design
- Color-coded bars by model (GPT-4 green, Gemini blue, Claude brown, etc.)
- Clear axis labels showing dataset names and score values
- Responsive layout adapting to container width
Adding Your Own Data
Edit the data.csv file to add benchmark results. Key columns for the chart:
| Column | Description | Example |
|---|---|---|
chart_series |
Model name (determines color/legend) | GPT-4.1, Claude Sonnet 4 |
chart_label |
Dataset name (x-axis label) | DBpedia-WebNLG (refined) |
chart_metric |
Metric being measured | F1_overall |
chart_value |
The score (0-1 scale) | 0.6472 |
source_citation |
Paper citation | Plu et al., 2025 |
source_url |
Link to source paper | https://ceur-ws.org/... |
Example CSV Row
1 2 | |
Customization Guide
Changing Colors
Edit the providerColors object in main.html:
1 2 3 4 5 6 | |
Adjusting Chart Height
Modify the .chart-container height in style.css:
1 2 3 | |
Technical Details
- Library: Chart.js 4.4.0
- Data Source: CSV file loaded dynamically
- Browser Compatibility: All modern browsers (Chrome, Firefox, Safari, Edge)
- Responsive: Yes, adapts to container width
Lesson Plan
Learning Objectives
By the end of this lesson, students will be able to:
- Explain what text-to-knowledge-graph benchmarks measure and why they matter
- Interpret F1 scores and compare model performance across different benchmarks
- Understand the importance of ontology compliance in knowledge graph extraction
- Analyze why frontier models (GPT-4, Claude, Gemini) outperform baseline models
- Recognize that benchmark results are not directly comparable across different evaluation frameworks
Target Audience
- College sophomores studying NLP or knowledge graphs
- Data scientists evaluating LLMs for information extraction
- Researchers working on knowledge graph construction
Prerequisites
- Basic understanding of knowledge graphs (nodes, edges, triples)
- Familiarity with precision, recall, and F1 score metrics
- Introduction to large language models
Activities
-
Exploration (10 min): Use the dataset filter to compare models within the same benchmark. How do frontier models compare to baseline models on Text2KGBench-LettrIA?
-
Analysis (15 min): Compare the F1 scores across benchmarks. Why might GPT-4 score 0.82 on KG-Generation but GPT-4.1 score 0.65 on Text2KGBench-LettrIA?
-
Discussion (10 min): What factors affect benchmark comparability? Consider dataset size, ontology complexity, evaluation methodology, and prompting strategies.
Assessment
- Quiz: Why can't we directly compare F1 scores across different benchmarks?
- Practical: Find a published paper with KG extraction results and add them to data.csv
References
- Text2KGBench Paper - 2023 - ISWC - Mihindukulasooriya et al. Original benchmark introducing ontology-driven text-to-KG evaluation
- Text2KGBench-LettrIA Paper - 2025 - CEUR-WS - Plu et al. Refined benchmark with improved data quality and frontier model results
- KG Generation Comparative Study - 2024 - arXiv - Trajanov et al. Comparison of GPT-4, LLaMA 2, and BERT
- Sepsis KG Construction - 2025 - PMC - Wang et al. Domain-specific KG construction evaluation
- Text2KGBench GitHub - Repository with benchmark code, datasets, and baseline implementations
- Chart.js Documentation - Library documentation for the visualization framework