# A Beginner’s Guide to Provenance Graphs

# What Is a Provenance Graph

# Basic Definition

A provenance graph, also known as a data lineage graph or causal graph, is a directed acyclic graph (DAG) or directed graph used to record information flows and causal relationships among system entities. By capturing system-level events (e.g., process creation, file read/write operations, network communications), it constructs a global view describing “who did what to whom.”

In cybersecurity, provenance graphs are widely used for Advanced Persistent Threat (APT) detection, attack investigation, and forensic analysis. The core idea is to reconstruct attack chains and identify malicious behaviors by tracing causal dependencies among system entities.

# Basic Elements of the Graph

A provenance graph contains two core components:

# Nodes

Nodes represent system entities and typically fall into three categories:

Process: A running program instance in the operating system, identified by attributes such as PID and process name.
File: A file object on disk, identified by its file path.
Network Connection (Socket / NetFlow): A network communication endpoint, identified by IP address and port.

# Edges

Edges represent system calls or operations between entities. Common edge types include:

Edge Type	Source	Target	Meaning
`read`	Process	File	Process reads a file
`write`	Process	File	Process writes to a file
`execute`	Process	File	Process executes a file
`fork` / `clone`	Process	Process	Process creates a child process
`connect`	Process	Socket	Process initiates connection
`recv` / `send`	Socket	Process / Process → Socket	Network data transmission

# A Simple Example

1
2
3
4
5
6


graph LR
    A["Network:attacker_ip"] -->|recv| B["Process:outlook.exe"]
    B -->|write| C["File:malware.exe"]
    C -->|execute| D["Process:malware.exe"]
    D -->|read| E["File:passwd"]
    D -->|connect| F["Network:C2_server"]

From a known malicious node, analysts can perform:

Forward tracking to identify the impact scope
Backward tracking to locate the attack entry point

# Why Provenance Graphs Are Needed

# Limitations of Traditional Detection Methods

Traditional intrusion detection systems (IDS) rely on single-event pattern matching, which suffers from:

Lack of context
High false positives
Difficulty detecting multi-stage APT attacks

# Advantages of Provenance Graphs

Causal correlation
Attack reconstruction
Anomaly detection

# Standard Datasets

# DARPA Transparent Computing (E3 / E5)

Multi-day audit logs with labeled APT scenarios
CDM format
Widely used benchmark

# OpTC

Enterprise-scale dataset (~1000 hosts)
Billions of events
Suitable for scalability evaluation

# NodLink (2023)

Fine-grained node-level and edge-level annotations
Improved labeling quality

# Construction Pipeline

1

Raw Logs → Parsing → Entity Identification → Relation Extraction → Graph Construction → Graph Compression

Common tools:

NetworkX
Neo4j
DGL
PyTorch Geometric

# Detection Methods

# Rule-Based

SLEUTH, Morse, RapSheet

# Statistical / Anomaly-Based

StreamSpot, ProvDetector, Unicorn

# GNN-Based

ThreaTrace, MAGIC, Kairos, Flash

# Embedding-Based

Shadewatcher, ProvNinja

# LLM-Assisted (Emerging)

Graph-to-text reasoning and attack narrative generation

# Evaluation

Metrics:

Precision
Recall
F1-score
False Positive Rate

Evaluation levels:

Node-level
Edge-level
Graph-level
Scenario-level

Orthrus framework evaluates:

Effectiveness
Timeliness
Scalability
Robustness
Generalizability

# Research Directions

Real-time detection
Cross-host attack analysis
Adversarial robustness
Concept drift adaptation
Interpretability
Efficient storage and compression

# Recommended Practice Path

Start with DARPA E3
Build graphs using Python + NetworkX
Visualize with Graphviz or Gephi
Reproduce a baseline method