Featured image of post A Beginner’s Guide to Provenance Graphs

A Beginner’s Guide to Provenance Graphs

# A Beginner’s Guide to Provenance Graphs


# What Is a Provenance Graph

# Basic Definition

A provenance graph, also known as a data lineage graph or causal graph, is a directed acyclic graph (DAG) or directed graph used to record information flows and causal relationships among system entities. By capturing system-level events (e.g., process creation, file read/write operations, network communications), it constructs a global view describing “who did what to whom.”

In cybersecurity, provenance graphs are widely used for Advanced Persistent Threat (APT) detection, attack investigation, and forensic analysis. The core idea is to reconstruct attack chains and identify malicious behaviors by tracing causal dependencies among system entities.

# Basic Elements of the Graph

A provenance graph contains two core components:

# Nodes

Nodes represent system entities and typically fall into three categories:

  • Process: A running program instance in the operating system, identified by attributes such as PID and process name.
  • File: A file object on disk, identified by its file path.
  • Network Connection (Socket / NetFlow): A network communication endpoint, identified by IP address and port.

# Edges

Edges represent system calls or operations between entities. Common edge types include:

Edge Type Source Target Meaning
read Process File Process reads a file
write Process File Process writes to a file
execute Process File Process executes a file
fork / clone Process Process Process creates a child process
connect Process Socket Process initiates connection
recv / send Socket Process / Process → Socket Network data transmission

# A Simple Example

1
2
3
4
5
6
graph LR
    A["Network:attacker_ip"] -->|recv| B["Process:outlook.exe"]
    B -->|write| C["File:malware.exe"]
    C -->|execute| D["Process:malware.exe"]
    D -->|read| E["File:passwd"]
    D -->|connect| F["Network:C2_server"]

From a known malicious node, analysts can perform:

  • Forward tracking to identify the impact scope
  • Backward tracking to locate the attack entry point

# Why Provenance Graphs Are Needed

# Limitations of Traditional Detection Methods

Traditional intrusion detection systems (IDS) rely on single-event pattern matching, which suffers from:

  • Lack of context
  • High false positives
  • Difficulty detecting multi-stage APT attacks

# Advantages of Provenance Graphs

  • Causal correlation
  • Attack reconstruction
  • Anomaly detection

# Standard Datasets

# DARPA Transparent Computing (E3 / E5)

  • Multi-day audit logs with labeled APT scenarios
  • CDM format
  • Widely used benchmark

# OpTC

  • Enterprise-scale dataset (~1000 hosts)
  • Billions of events
  • Suitable for scalability evaluation
  • Fine-grained node-level and edge-level annotations
  • Improved labeling quality

# Construction Pipeline

1
Raw Logs → Parsing → Entity Identification → Relation Extraction → Graph Construction → Graph Compression

Common tools:

  • NetworkX
  • Neo4j
  • DGL
  • PyTorch Geometric

# Detection Methods

# Rule-Based

SLEUTH, Morse, RapSheet

# Statistical / Anomaly-Based

StreamSpot, ProvDetector, Unicorn

# GNN-Based

ThreaTrace, MAGIC, Kairos, Flash

# Embedding-Based

Shadewatcher, ProvNinja

# LLM-Assisted (Emerging)

Graph-to-text reasoning and attack narrative generation


# Evaluation

Metrics:

  • Precision
  • Recall
  • F1-score
  • False Positive Rate

Evaluation levels:

  • Node-level
  • Edge-level
  • Graph-level
  • Scenario-level

Orthrus framework evaluates:

  • Effectiveness
  • Timeliness
  • Scalability
  • Robustness
  • Generalizability

# Research Directions

  • Real-time detection
  • Cross-host attack analysis
  • Adversarial robustness
  • Concept drift adaptation
  • Interpretability
  • Efficient storage and compression

  1. Start with DARPA E3
  2. Build graphs using Python + NetworkX
  3. Visualize with Graphviz or Gephi
  4. Reproduce a baseline method
Licensed under CC BY-NC-SA 4.0
Built with hugo 🖤 Stack
版权声明:Licensed under CC BY-NC-SA 4.0「署名-非商业性使用-相同方式共享 4.0 国际」