Knowledge Graph Construction Pipeline

Run the Knowledge Graph Construction Pipeline MicroSim Fullscreen

About This MicroSim

This MicroSim presents the knowledge graph construction pipeline as an interactive flowchart. Students step through the stages of building a biomedical KG from heterogeneous source databases to a unified, queryable graph.

Pipeline Stages

Source Databases — Multiple biological databases (UniProt, PDB, KEGG, STRING, OMIM) provide raw data
Extraction — Parse and extract entities and relationships from each source format (XML, JSON, flat files)
Schema Mapping — Map extracted entities to a common ontology or schema (e.g., map UniProt protein IDs to gene names)
Entity Resolution — Identify and merge duplicate entities across sources (the same protein may have different IDs in different databases)
Integration — Combine all resolved entities and relationships into a single graph structure
Unified Knowledge Graph — The final KG with consistent entity types, relationship labels, and properties

How to Use

Click each stage to see its description, challenges, and example tools
Step through — Follow the data flow from source databases to unified KG
Consider the challenges — Entity resolution and schema mapping are the hardest stages

Iframe Embed Code

<iframe src="https://dmccreary.github.io/bioinformatics/sims/kg-construction-pipeline/main.html"
        height="520"
        width="100%"
        scrolling="no"></iframe>

Lesson Plan

Grade Level

College introductory bioinformatics

Duration

15-20 minutes

Prerequisites

Understanding of biological databases and their contents
Concept of knowledge graphs (nodes, edges, types)
Basic awareness of data integration challenges

Activities

Exploration (5 min): Click each pipeline stage and note the key challenge at each step.
Entity Resolution Exercise (5 min): The protein p53 appears as "TP53" in GenBank, "P04637" in UniProt, and "7157" in Entrez Gene. Discuss how entity resolution would merge these into a single node.
Discussion (5 min): Why is schema mapping necessary? What happens if two databases use different relationship types for the same biological interaction?
Assessment (3 min): Answer the reflection questions below.

Assessment

What are the six stages of the KG construction pipeline?
Why is entity resolution one of the most challenging stages?
How does a common ontology (like Gene Ontology) help with schema mapping?
What would happen if you skipped the entity resolution stage?