Text-to-Knowledge-Graph Benchmark Comparison

An interactive visualization comparing AI model performance on text-to-knowledge-graph extraction tasks across multiple published benchmarks.

View Fullscreen

Embed This MicroSim

Copy this iframe to embed in your own website:

<iframe src="https://dmccreary.github.io/conversational-ai/sims/text-to-graph/main.html" width="100%" height="550" scrolling="no" style="border:none;"></iframe>

Overview

This chart aggregates published benchmark results from multiple text-to-knowledge-graph evaluation frameworks, allowing comparison of how different language models extract structured knowledge graph triples from natural language text.

Misleading Metric Claim

Lettria claims a 99.8% score for their fine-tuned Gemma 3 27B model. However, this is a reliability score (percentage of outputs the system can parse), not an F1 score. A model can have high reliability (outputs are well-formatted) while still having lower F1 accuracy (extracted triples are incorrect). The actual F1 scores for their fine-tuned model are not provided in the text. Lettria Perseus Reference

Included Benchmarks

Benchmark	Datasets	Description	Source
Text2KGBench	Wikidata-TekGen, DBpedia-WebNLG	Original ontology-driven KG extraction benchmark	Mihindukulasooriya et al., 2023
Text2KGBench-LettrIA	DBpedia-WebNLG (refined)	Refined benchmark with improved data quality	Plu et al., 2025
KG-Generation	General	Comparative study of KG generation	Trajanov et al., 2024
Sepsis-KG	Medical/Sepsis domain	Domain-specific KG construction	Wang et al., 2025

Current Results Summary

Model	F1 Score	Benchmark	Dataset
GPT-4	0.82	KG-Generation	General
LLaMA 2	0.77	KG-Generation	General
GPT-4	0.77	Sepsis-KG	Sepsis
BERT	0.72	KG-Generation	General
Claude Opus 4.5 (Est.)	~0.70	Text2KGBench-LettrIA	DBpedia-WebNLG (refined)
Gemini 2.5 Pro	0.6595	Text2KGBench-LettrIA	DBpedia-WebNLG (refined)
Claude Sonnet 4	0.6487	Text2KGBench-LettrIA	DBpedia-WebNLG (refined)
GPT-4.1	0.6472	Text2KGBench-LettrIA	DBpedia-WebNLG (refined)
Llama 3	0.48	Sepsis-KG	Sepsis
Qwen2	0.44	Sepsis-KG	Sepsis
Vicuna-13B	0.35	Text2KGBench	Wikidata-TekGen
Vicuna-13B	0.30	Text2KGBench	DBpedia-WebNLG
Alpaca-LoRA-13B	0.27	Text2KGBench	Wikidata-TekGen
Alpaca-LoRA-13B	0.25	Text2KGBench	DBpedia-WebNLG

Estimated Value

The Claude Opus 4.5 result (~0.70 F1) is an estimate, not a published benchmark result. It is based on typical performance improvements between Claude Sonnet 4 (0.6487) and Opus models on reasoning tasks (~5-15% improvement). The actual performance may differ.

Note: Results from different benchmarks are not directly comparable due to differences in evaluation methodology, datasets, and task definitions.

Features

Interactive Elements

Dataset Filter: Select specific datasets or benchmarks to compare
Metric Filter: Filter by evaluation metric (F1_overall, Precision, Recall)
Hover Tooltips: View exact scores by hovering over bars
Clickable Legend: Click model names to show/hide their results

Visual Design

Color-coded bars by model (GPT-4 green, Gemini blue, Claude brown, etc.)
Clear axis labels showing dataset names and score values
Responsive layout adapting to container width

Adding Your Own Data

Edit the data.csv file to add benchmark results. Key columns for the chart:

Column	Description	Example
`chart_series`	Model name (determines color/legend)	GPT-4.1, Claude Sonnet 4
`chart_label`	Dataset name (x-axis label)	DBpedia-WebNLG (refined)
`chart_metric`	Metric being measured	F1_overall
`chart_value`	The score (0-1 scale)	0.6472
`source_citation`	Paper citation	Plu et al., 2025
`source_url`	Link to source paper	https://ceur-ws.org/...

Example CSV Row

run_id,benchmark_name,dataset_name,model_provider,model_name,chart_series,chart_label,chart_metric,chart_value,source_citation,source_url
T2KGBL-001,Text2KGBench-LettrIA,DBpedia-WebNLG (refined),OpenAI,GPT-4.1,GPT-4.1,DBpedia-WebNLG (refined),F1_overall,0.6472,"Plu et al., 2025",https://ceur-ws.org/Vol-4041/paper3.pdf

Customization Guide

Changing Colors

Edit the providerColors object in main.html:

const providerColors = {
    'GPT-4.1': { bg: 'rgba(16, 163, 127, 0.8)', border: 'rgb(16, 163, 127)' },
    'Claude Sonnet 4': { bg: 'rgba(204, 133, 102, 0.8)', border: 'rgb(204, 133, 102)' },
    'Gemini 2.5 Pro': { bg: 'rgba(66, 133, 244, 0.8)', border: 'rgb(66, 133, 244)' },
    // Add more models...
};

Adjusting Chart Height

Modify the .chart-container height in style.css:

.chart-container {
    height: 400px; /* Adjust as needed */
}

Technical Details

Library: Chart.js 4.4.0
Data Source: CSV file loaded dynamically
Browser Compatibility: All modern browsers (Chrome, Firefox, Safari, Edge)
Responsive: Yes, adapts to container width

Lesson Plan

Learning Objectives

By the end of this lesson, students will be able to:

Explain what text-to-knowledge-graph benchmarks measure and why they matter
Interpret F1 scores and compare model performance across different benchmarks
Understand the importance of ontology compliance in knowledge graph extraction
Analyze why frontier models (GPT-4, Claude, Gemini) outperform baseline models
Recognize that benchmark results are not directly comparable across different evaluation frameworks

Target Audience

College sophomores studying NLP or knowledge graphs
Data scientists evaluating LLMs for information extraction
Researchers working on knowledge graph construction

Prerequisites

Basic understanding of knowledge graphs (nodes, edges, triples)
Familiarity with precision, recall, and F1 score metrics
Introduction to large language models

Activities

Exploration (10 min): Use the dataset filter to compare models within the same benchmark. How do frontier models compare to baseline models on Text2KGBench-LettrIA?
Analysis (15 min): Compare the F1 scores across benchmarks. Why might GPT-4 score 0.82 on KG-Generation but GPT-4.1 score 0.65 on Text2KGBench-LettrIA?
Discussion (10 min): What factors affect benchmark comparability? Consider dataset size, ontology complexity, evaluation methodology, and prompting strategies.

Assessment

Quiz: Why can't we directly compare F1 scores across different benchmarks?
Practical: Find a published paper with KG extraction results and add them to data.csv

References

Text2KGBench Paper - 2023 - ISWC - Mihindukulasooriya et al. Original benchmark introducing ontology-driven text-to-KG evaluation
Text2KGBench-LettrIA Paper - 2025 - CEUR-WS - Plu et al. Refined benchmark with improved data quality and frontier model results
KG Generation Comparative Study - 2024 - arXiv - Trajanov et al. Comparison of GPT-4, LLaMA 2, and BERT
Sepsis KG Construction - 2025 - PMC - Wang et al. Domain-specific KG construction evaluation
Text2KGBench GitHub - Repository with benchmark code, datasets, and baseline implementations
Chart.js Documentation - Library documentation for the visualization framework