Quiz: Process Mining, Data Lineage, and Provenance¶

Test your understanding of event logs, process discovery and conformance checking, column-level lineage, OpenLineage, event sourcing, CDC, and the difference between lineage and provenance.

1. Which three fields must appear in every event record for an event log to be useful for process mining?¶

Case ID, Activity, and Timestamp
Username, Password, and Session token
Source IP, Destination IP, and Port
Schema version, Encoding, and Compression

Show Answer

The correct answer is A. The chapter is explicit: case ID identifies the process instance, activity identifies what happened, and timestamp records when. Without all three, the log cannot be used to reconstruct process flow. The other options describe security or transport fields that are not process-mining requirements.

Concept Tested: Event Log

2. What does process discovery produce as output?¶

A list of users sorted by activity frequency
A process model — for example a directly-follows graph — derived from event log data, showing the activities that actually occurred and the transitions between them
A SQL query plan for the underlying database
An anonymized version of the event log

Show Answer

The correct answer is B. Process discovery infers a process model (often a directly-follows graph or Petri net) from event logs, showing the empirical sequence of activities. User-frequency reports (A), query plans (C), and anonymization (D) are unrelated outputs.

Concept Tested: Process Discovery

3. What does conformance checking compare?¶

Two different database schemas to find matching columns
The graph schema against a SKOS thesaurus
An actual event log against a reference process model, identifying deviations such as skipped steps, reversed steps, or unauthorized activities
The current data lake size against last quarter's size

Show Answer

The correct answer is C. Conformance checking compares the empirical event log against the intended reference process model and surfaces the deviations. Option A describes schema matching. Option B is unrelated. Option D is capacity monitoring.

Concept Tested: Conformance Checking

4. Why is column-level lineage more difficult to capture than table-level lineage?¶

Column-level lineage requires parsing the SQL or transformation logic at each pipeline step to identify which input columns feed which output columns, often through complex dataflow analysis
Column-level lineage requires a quantum database
Column-level lineage is forbidden under most data classification policies
Column-level lineage cannot be visualized

Show Answer

The correct answer is A. The chapter explains that column-level lineage demands SQL/dataflow parsing at each step to map input columns to output columns. The other options are not real limitations.

Concept Tested: Column-Level Lineage

5. A revenue figure in a quarterly report is suspected to be wrong. The investigator needs to trace it back through aggregations and ETL steps to the original source records. Which technique does this directly?¶

Conformance checking
Downstream lineage
Upstream lineage
Change data capture

Show Answer

The correct answer is C. Upstream lineage traces a data value backward from its current location to its original source through every transformation. Downstream lineage (B) runs the opposite direction. Conformance checking (A) is about process flow, not data flow. CDC (D) is a capture mechanism, not a tracing technique.

Concept Tested: Upstream Lineage

6. How does the chapter distinguish data lineage from data provenance?¶

Lineage and provenance are synonyms
Lineage applies only to structured data; provenance applies only to unstructured data
Lineage answers where data came from (structural pipeline query); provenance answers whether it can be trusted and who is accountable (custody chain and transformation history)
Lineage is a graph; provenance is a relational table

Show Answer

The correct answer is C. Lineage is about structural origin (the pipeline path); provenance is about trust, custody chain, and accountability. They are complementary but distinct. The other options misstate the relationship.

Concept Tested: Lineage vs Provenance

7. A team wants their pipeline tool's lineage output to be readable by their lineage catalog from a different vendor without writing custom integration code. Which standard supports this interoperability?¶

ISO 11179
IEEE XES
SKOS
The OpenLineage Standard

Show Answer

The correct answer is D. OpenLineage is the open specification for portable lineage events — a JSON event model so that tools implementing the spec can interoperate. ISO 11179 (A) is metadata-registry, IEEE XES (B) is event-log format, SKOS (C) is a vocabulary standard. Only OpenLineage targets lineage interoperability.

Concept Tested: OpenLineage Standard

8. A retail team is designing a new order-processing system and wants a built-in, tamper-evident audit trail. They are evaluating event sourcing. Which property does event sourcing give them by design?¶

Every change is stored as an immutable event in an append-only log, so the current state can be reconstructed by replay and no past event can be silently altered
Automatic compression of all event payloads
Forward chaining inference over an ontology
Bypass of any compliance review for state changes

Show Answer

The correct answer is A. Event sourcing stores every state change as an immutable append-only event, giving replayable state reconstruction and tamper-evident history — exactly the audit trail described. Compression (B), inference (C), and compliance bypass (D) are not properties of event sourcing.

Concept Tested: Event Sourcing

9. A legacy relational database was not designed with event sourcing in mind, but the team needs real-time lineage events flowing into the context graph. Which technique observes the database transaction log to publish row-level changes as a downstream event stream?¶

Process enhancement
Change data capture
Schema matching
CQRS

Show Answer

The correct answer is B. Change data capture (CDC) monitors the database transaction log and publishes inserts, updates, and deletes as events — perfect for retrofitting real-time event flow onto a legacy database. Process enhancement (A) is process mining. Schema matching (C) is metadata alignment. CQRS (D) is an architectural pattern but is not the transaction-log observation technique itself.

Concept Tested: Change Data Capture

10. A compliance officer asks "Which version of the pricing policy was in effect when this order was placed last year?" Which context graph capability is most directly required to answer this temporal question?¶

Temporal versioning — valid_from and valid_to timestamps on graph nodes and edges that let queries reconstruct state at any past point in time
Differential privacy
Knowledge graph embedding training
Graph sharding by department

Show Answer

The correct answer is A. Temporal versioning is exactly the capability the chapter describes for point-in-time queries: validity ranges on nodes and edges let queries return the state of the graph as it was on any past date. The other options solve unrelated problems.

Concept Tested: Temporal Versioning