Text Mining Pipeline for Knowledge Graph Population
Run the Text Mining Pipeline for Knowledge Graph Population MicroSim Fullscreen
About This MicroSim
This MicroSim presents the text mining pipeline for extracting biomedical knowledge from scientific literature and populating a knowledge graph. The flowchart shows how unstructured text (PubMed abstracts) is transformed into structured knowledge (triples) through NLP processing.
Pipeline Stages
- PubMed Abstracts (blue) — Raw scientific text from the biomedical literature
- Named Entity Recognition (NER) (green) — Identify biomedical entities (genes, diseases, drugs, proteins) in the text
- Relation Extraction (green) — Determine relationships between identified entities (e.g., "drug X inhibits protein Y")
- Triple Generation (green) — Convert extracted relationships into structured triples (subject, predicate, object)
- KG Update (purple) — Add new triples to the knowledge graph, resolving entities and deduplicating
Color Coding
- Blue — Input data (raw text)
- Green — NLP processing stages
- Purple — Knowledge graph output
How to Use
- Click each pipeline stage to see its description, example tools, and sample outputs
- Follow the transformation — Watch how "BRCA1 is associated with breast cancer" becomes the triple (BRCA1, associated_with, breast_cancer)
- Consider challenges — Ambiguity, negation, and abbreviations make each NLP stage difficult
Iframe Embed Code
1 2 3 4 | |
Lesson Plan
Grade Level
College introductory bioinformatics
Duration
15-20 minutes
Prerequisites
- Basic understanding of natural language processing
- Concept of knowledge graphs and triples
- Familiarity with biomedical literature (PubMed)
Activities
- Exploration (5 min): Click each stage and note the key challenges and tools used.
- Manual NER (5 min): Read this sentence: "Imatinib inhibits BCR-ABL and is used to treat chronic myeloid leukemia." Manually identify all entities and classify them (drug, protein, disease). Extract the relationships.
- Discussion (5 min): Why is automated text mining necessary despite its imperfections? Consider that PubMed adds >1 million new articles per year.
- Assessment (3 min): Answer the reflection questions below.
Assessment
- What is Named Entity Recognition, and why is it the first NLP step?
- What is a triple, and how does it represent a relationship from text?
- Why is entity resolution necessary when adding text-mined triples to an existing knowledge graph?
- What are the main sources of error in biomedical text mining?