Text Mining Pipeline for Knowledge Graph Population

Run the Text Mining Pipeline for Knowledge Graph Population MicroSim Fullscreen

About This MicroSim

This MicroSim presents the text mining pipeline for extracting biomedical knowledge from scientific literature and populating a knowledge graph. The flowchart shows how unstructured text (PubMed abstracts) is transformed into structured knowledge (triples) through NLP processing.

Pipeline Stages

PubMed Abstracts (blue) — Raw scientific text from the biomedical literature
Named Entity Recognition (NER) (green) — Identify biomedical entities (genes, diseases, drugs, proteins) in the text
Relation Extraction (green) — Determine relationships between identified entities (e.g., "drug X inhibits protein Y")
Triple Generation (green) — Convert extracted relationships into structured triples (subject, predicate, object)
KG Update (purple) — Add new triples to the knowledge graph, resolving entities and deduplicating

Color Coding

Blue — Input data (raw text)
Green — NLP processing stages
Purple — Knowledge graph output

How to Use

Click each pipeline stage to see its description, example tools, and sample outputs
Follow the transformation — Watch how "BRCA1 is associated with breast cancer" becomes the triple (BRCA1, associated_with, breast_cancer)
Consider challenges — Ambiguity, negation, and abbreviations make each NLP stage difficult

Iframe Embed Code

<iframe src="https://dmccreary.github.io/bioinformatics/sims/text-mining-pipeline/main.html"
        height="520"
        width="100%"
        scrolling="no"></iframe>

Lesson Plan

Grade Level

College introductory bioinformatics

Duration

15-20 minutes

Prerequisites

Basic understanding of natural language processing
Concept of knowledge graphs and triples
Familiarity with biomedical literature (PubMed)

Activities

Exploration (5 min): Click each stage and note the key challenges and tools used.
Manual NER (5 min): Read this sentence: "Imatinib inhibits BCR-ABL and is used to treat chronic myeloid leukemia." Manually identify all entities and classify them (drug, protein, disease). Extract the relationships.
Discussion (5 min): Why is automated text mining necessary despite its imperfections? Consider that PubMed adds >1 million new articles per year.
Assessment (3 min): Answer the reflection questions below.

Assessment

What is Named Entity Recognition, and why is it the first NLP step?
What is a triple, and how does it represent a relationship from text?
Why is entity resolution necessary when adding text-mined triples to an existing knowledge graph?
What are the main sources of error in biomedical text mining?