# A Beginner’s Guide to Provenance Graphs
# What Is a Provenance Graph
# Basic Definition
A provenance graph, also known as a data lineage graph or causal graph, is a directed acyclic graph (DAG) or directed graph used to record information flows and causal relationships among system entities. By capturing system-level events (e.g., process creation, file read/write operations, network communications), it constructs a global view describing “who did what to whom.”
In cybersecurity, provenance graphs are widely used for Advanced Persistent Threat (APT) detection, attack investigation, and forensic analysis. The core idea is to reconstruct attack chains and identify malicious behaviors by tracing causal dependencies among system entities.
# Basic Elements of the Graph
A provenance graph contains two core components:
# Nodes
Nodes represent system entities and typically fall into three categories:
- Process: A running program instance in the operating system, identified by attributes such as PID and process name.
- File: A file object on disk, identified by its file path.
- Network Connection (Socket / NetFlow): A network communication endpoint, identified by IP address and port.
# Edges
Edges represent system calls or operations between entities. Common edge types include:
| Edge Type | Source | Target | Meaning |
|---|---|---|---|
read |
Process | File | Process reads a file |
write |
Process | File | Process writes to a file |
execute |
Process | File | Process executes a file |
fork / clone |
Process | Process | Process creates a child process |
connect |
Process | Socket | Process initiates connection |
recv / send |
Socket | Process / Process → Socket | Network data transmission |
# A Simple Example
|
|
From a known malicious node, analysts can perform:
- Forward tracking to identify the impact scope
- Backward tracking to locate the attack entry point
# Why Provenance Graphs Are Needed
# Limitations of Traditional Detection Methods
Traditional intrusion detection systems (IDS) rely on single-event pattern matching, which suffers from:
- Lack of context
- High false positives
- Difficulty detecting multi-stage APT attacks
# Advantages of Provenance Graphs
- Causal correlation
- Attack reconstruction
- Anomaly detection
# Standard Datasets
# DARPA Transparent Computing (E3 / E5)
- Multi-day audit logs with labeled APT scenarios
- CDM format
- Widely used benchmark
# OpTC
- Enterprise-scale dataset (~1000 hosts)
- Billions of events
- Suitable for scalability evaluation
# NodLink (2023)
- Fine-grained node-level and edge-level annotations
- Improved labeling quality
# Construction Pipeline
|
|
Common tools:
- NetworkX
- Neo4j
- DGL
- PyTorch Geometric
# Detection Methods
# Rule-Based
SLEUTH, Morse, RapSheet
# Statistical / Anomaly-Based
StreamSpot, ProvDetector, Unicorn
# GNN-Based
ThreaTrace, MAGIC, Kairos, Flash
# Embedding-Based
Shadewatcher, ProvNinja
# LLM-Assisted (Emerging)
Graph-to-text reasoning and attack narrative generation
# Evaluation
Metrics:
- Precision
- Recall
- F1-score
- False Positive Rate
Evaluation levels:
- Node-level
- Edge-level
- Graph-level
- Scenario-level
Orthrus framework evaluates:
- Effectiveness
- Timeliness
- Scalability
- Robustness
- Generalizability
# Research Directions
- Real-time detection
- Cross-host attack analysis
- Adversarial robustness
- Concept drift adaptation
- Interpretability
- Efficient storage and compression
# Recommended Practice Path
- Start with DARPA E3
- Build graphs using Python + NetworkX
- Visualize with Graphviz or Gephi
- Reproduce a baseline method