A microservice trace anomaly detection tool based on Graph Neural Networks. This project uses Graph Attention Networks (GAT) to perform anomaly detection on microservice trace data.
Features
Anomaly detection based on Graph Attention Network (GAT)
Support for structured and attributed modeling of microservice trace data
Automated model training and evaluation pipeline
Rich evaluation metrics and visualization charts
Support for multiple aggregation methods in prediction
Dependencies
The project uses Python 3.13+ and depends on the following main libraries:
PyTorch & PyTorch Geometric: Deep learning framework
Pandas & NumPy: Data processing
Scikit-learn: Machine learning evaluation
Matplotlib: Visualization
Typer: Command line interface
Installation
Install dependencies using uv:
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv sync
Data Format
Training and testing data should be in Parquet format and contain the following fields:
Required fields:
trace_id: Trace ID
span_id: Span ID
parent_span_id: Parent Span ID
span_name: Span name
primary_service: Primary service name
start_time: Start time
duration: Duration
Aggregated feature fields:
span_count: Number of spans
total_duration: Total duration
avg_duration: Average duration
max_duration: Maximum duration
min_duration: Minimum duration
duration_std: Duration standard deviation
unique_services: Number of unique services
unique_spans: Number of unique spans
error_rate: Error rate
root_span: Root span name
Usage
Model Training
Train anomaly detection model using training data:
# Basic training command
uv run python cli.py train --data data/training_data.parquet
# Custom training parameters
uv run python cli.py train \
--data data/training_data.parquet \
--output models \
--epochs 150 \
--learning-rate 0.001 \
--latent-dim 32 \
--batch-size 2
Training parameter description:
--data: Training data file path (required)
--output: Model output directory (default: models)
--epochs: Number of training epochs (default: 100)
--learning-rate: Learning rate (default: 0.005)
--latent-dim: Latent space dimension (default: 16)
--batch-size: Batch size (default: 1)
After training is complete, model files will be saved in the specified output directory:
model.pth: Trained neural network model
processor.joblib: Data preprocessor
config.joblib: Model configuration
trace_config.joblib: Data configuration
Model Evaluation
Perform anomaly detection evaluation on test data:
# Basic evaluation command
uv run python cli.py evaluate
# Specify model and aggregation method
uv run python cli.py evaluate \
--model models \
--aggregation max \
--output-dir evaluation_results
Trace Anomaly Detection
A microservice trace anomaly detection tool based on Graph Neural Networks. This project uses Graph Attention Networks (GAT) to perform anomaly detection on microservice trace data.
Features
Dependencies
The project uses Python 3.13+ and depends on the following main libraries:
Installation
Install dependencies using uv:
Data Format
Training and testing data should be in Parquet format and contain the following fields:
Required fields:
trace_id
: Trace IDspan_id
: Span IDparent_span_id
: Parent Span IDspan_name
: Span nameprimary_service
: Primary service namestart_time
: Start timeduration
: DurationAggregated feature fields:
span_count
: Number of spanstotal_duration
: Total durationavg_duration
: Average durationmax_duration
: Maximum durationmin_duration
: Minimum durationduration_std
: Duration standard deviationunique_services
: Number of unique servicesunique_spans
: Number of unique spanserror_rate
: Error rateroot_span
: Root span nameUsage
Model Training
Train anomaly detection model using training data:
Training parameter description:
--data
: Training data file path (required)--output
: Model output directory (default: models)--epochs
: Number of training epochs (default: 100)--learning-rate
: Learning rate (default: 0.005)--latent-dim
: Latent space dimension (default: 16)--batch-size
: Batch size (default: 1)After training is complete, model files will be saved in the specified output directory:
model.pth
: Trained neural network modelprocessor.joblib
: Data preprocessorconfig.joblib
: Model configurationtrace_config.joblib
: Data configurationModel Evaluation
Perform anomaly detection evaluation on test data:
Evaluation parameter description:
--model
: Model directory path (default: models)--aggregation
: Aggregation method, options: max, mean, percentile_95 (default: max)--output-dir
: Evaluation results output directory (default: evaluation_results)Test data structure requirements:
Evaluation Results
After evaluation is complete, the following files will be generated:
Main evaluation metrics:
Project Architecture
Model Principle
This project implements an anomaly detection method based on Graph Attention Networks:
Example Workflow