Running the Snakemake Pipeline#
This tutorial covers how to use NeuRodent’s Snakemake pipeline for large-scale, reproducible EEG analysis workflows.
What is Snakemake?#
Snakemake is a workflow management system that creates reproducible, scalable data analyses. Key features:
Automatic dependency tracking: Only recomputes what’s needed when data or parameters change
Parallel execution: Processes multiple samples simultaneously
Fault tolerance: Resumes from where it left off after failures
Resource management: Specify memory, CPU, and time requirements for each step
Cluster integration: Seamlessly scales from laptops to HPC clusters
Why Use the Snakemake Pipeline?#
For experiments with many animals and days of recordings, manual analysis becomes impractical. The NeuRodent pipeline automates:
Raw EEG data → Windowed Analysis Results (WARs)
Quality filtering and artifact removal
Diagnostic figure generation
Statistical analyses and publication-ready plots
All intermediate processing steps
Installation#
Ensure you have NeuRodent and Snakemake installed:
# Install both with uv
uv add neurodent snakemake
# Or with pip
pip install neurodent snakemake
Pipeline Overview#
The pipeline processes data through multiple stages:
Raw EEG Data
↓
WAR Generation (windowed features)
↓
Quality Filtering (remove bad days/genotypes)
↓
Standardization (channel ordering)
↓
Fragment Filtering (temporal artifacts)
↓
Channel Filtering (spatial artifacts)
↓
WAR Flattening (aggregate across time)
↓
Zeitgeber Analysis (circadian features)
↓
Statistical Figures & Plots
Each stage produces intermediate files that are automatically cached. If you modify parameters or data, Snakemake only reruns affected steps.
Configuration#
The pipeline requires two configuration files in the config/ directory:
1. config/config.yaml - Analysis Parameters#
This file defines paths, analysis settings, and computational resources:
# Core directories
base_folder: "/path/to/neurodent/output"
data_parent_folder: "/path/to/raw/eeg/data"
temp_directory: "/path/to/temp/storage"
# Sample configuration
samples:
samples_file: "config/samples.json"
quality_filter:
exclude_unknown_genotypes: true
exclude_bad_animaldays: true
# Analysis parameters
analysis:
war_generation:
mode: "nest"
lro_kwargs:
mode: "bin"
multiprocess_mode: "dask"
fragment_filter_config:
logrms_range:
z_range: 3
high_beta:
max_beta_prop: 0.4
channel_filter_config:
lof:
reject_lof_threshold: 2.5
min_valid_channels: 3
# Resource requirements (adjust for your system)
cluster:
war_generation:
time: "3h"
mem_mb: 70_000
threads: 10
Key parameters:
Paths: Set
base_folder,data_parent_folder, andtemp_directoryfor your systemAnalysis: Configure filtering thresholds, detection parameters, etc.
Resources: Memory, time, and CPU requirements (used for cluster scheduling)
2. config/samples.json - Sample Metadata#
This file defines which animals to process and their experimental metadata:
{
"data_parent_folder": "/path/to/raw/eeg/data",
"data_folders_to_animal_ids": {
"experiment1_cohort1": ["M1", "M2", "F1", "F2"],
"experiment2_cohort2": ["M3", "M4", "F3"]
},
"GENOTYPE_ALIASES": {
"MWT": ["M1", "M3"],
"FWT": ["F1", "F3"],
"MMut": ["M2", "M4"],
"FMut": ["F2"]
},
"bad_channels": {
"experiment1_cohort1 M1": {
"M1 MWT Jan-08-2022": ["LHip", "RHip"]
}
},
"bad_folder_animalday": [
"experiment1_cohort1 M2"
]
}
Key sections:
data_folders_to_animal_ids: Maps recording folders to animal IDs
GENOTYPE_ALIASES: Groups animals by genotype for statistical comparisons
bad_channels: Per-session bad channel lists for manual artifact rejection
bad_folder_animalday: Complete animals/days to exclude from analysis
Running the Pipeline#
Basic Execution#
Run the pipeline on your local machine or compute node:
# Dry run to preview what will execute
snakemake --dry-run
# Execute with 4 parallel jobs
snakemake --cores 4
# Run a specific analysis stage
snakemake --cores 4 results/wars_quality_filtered/
# Force rerun of a specific rule and downstream steps
snakemake --cores 4 --forcerun war_generation
Useful Commands#
# Visualize the workflow
snakemake --dag | dot -Tpng > dag.png
snakemake --rulegraph | dot -Tpng > rulegraph.png
# Unlock directory after crash
snakemake --unlock
# Clean up specific outputs to rerun
snakemake --delete-all-output
# Show which files are missing
snakemake --summary
Monitoring Progress#
Check the logs/ directory for detailed output from each rule:
# View real-time logs
tail -f logs/war_generation_*.log
# Check results directory
ls -lh results/
Running on HPC Clusters#
For large-scale analyses, you can submit pipeline jobs to an HPC cluster scheduler. Snakemake supports multiple schedulers including SLURM, SGE, PBS, LSF, and others.
General Cluster Execution#
Snakemake can interface with any job scheduler using profiles. The key is to:
Define resource requirements in
config.yaml(time, memory, threads)Create a cluster profile for your scheduler
Submit the pipeline using the profile
Snakemake will automatically:
Submit each rule as a separate job
Manage job dependencies
Monitor job status
Resubmit failed jobs (with
restart-times)
Example: SLURM Scheduler#
Here’s how to set up Snakemake with SLURM (adapt for your scheduler):
# Create a SLURM profile (one-time setup)
mkdir -p ~/.config/snakemake/slurm
cat > ~/.config/snakemake/slurm/config.yaml << 'EOF'
cluster: "sbatch --time={cluster.time} --mem={cluster.mem_mb} --cpus-per-task={threads} --job-name={rule} --output=logs/{rule}_%j.out"
jobs: 100
restart-times: 3
EOF
# Submit pipeline
snakemake --profile slurm
The pipeline reads resource requirements from your config.yaml:
cluster:
war_generation:
time: "3h"
mem_mb: 70_000
threads: 10
And Snakemake translates these to scheduler-specific flags.
Monitoring Cluster Jobs#
For SLURM:
# Check job queue
squeue -u $USER
# View job details
sacct --format=JobID,JobName,State,Elapsed,MaxRSS
# Monitor specific rule logs
tail -f logs/war_generation_*.out
For SGE:
# Check job queue
qstat -u $USER
# View job details
qacct -j <job_id>
For PBS:
# Check job queue
qstat -u $USER
# View job details
qstat -f <job_id>
Other Schedulers#
Snakemake supports many schedulers. See the Snakemake cluster documentation for:
SGE (Sun Grid Engine)
PBS/Torque
LSF (Load Sharing Facility)
Cloud executors (Google Cloud, AWS, Azure)
The general pattern is the same: create a profile with your scheduler’s submission command and resource mappings.
Pipeline Outputs#
The pipeline creates a structured results directory:
results/
├── wars_quality_filtered/ # Initial WARs with quality filtering applied
├── wars_standardized/ # Channel-reordered WARs
├── wars_fragment_filtered/ # After temporal artifact removal
├── wars_flattened_manual/ # Aggregated with manual channel filtering
├── wars_flattened_lof/ # Aggregated with LOF channel filtering
├── diagnostic_figures/ # Per-animal QC plots
│ └── {animal}/
│ ├── unfiltered/ # Pre-filtering diagnostics
│ └── filtered/ # Post-filtering diagnostics
├── wars_zeitgeber/ # Circadian time-aligned features
├── zeitgeber_plots/ # Temporal heatmaps
├── relfreq_plots/ # Feature distributions
├── ep_figures/ # Statistical group comparisons
├── ep_heatmaps/ # Connectivity matrices
├── lof_evaluation/ # LOF threshold optimization
└── filtering_comparison_plots/ # Manual vs LOF comparison
logs/
└── {rule}_*.log # Execution logs for each rule
Key outputs:
.pklfiles: Python pickled WARs with computed features.jsonfiles: Metadata and analysis parametersFigures: PNG/PDF diagnostic and publication plots
CSV files: Exportable tabular data
Troubleshooting & Best Practices#
Common Issues#
Pipeline is locked:
snakemake --unlock
Need to rerun everything:
snakemake --forceall --cores 4
Jobs failing due to insufficient memory:
Increase memory in config.yaml under cluster section for the failing rule.
Want to test changes without full pipeline:
# Run just one rule
snakemake --cores 4 war_generation
# Or target specific output
snakemake --cores 4 results/wars_quality_filtered/animal1/
Best Practices#
Always dry-run first to preview execution:
snakemake --dry-runTest on small subset before full dataset:
Temporarily modify
samples.jsonto include just 1-2 animalsRun locally to verify parameters
Expand to full dataset once validated
Monitor resource usage:
Check logs for memory errors
Adjust
clusterresources inconfig.yamlas neededMost rules should complete within their time limits
Keep intermediate files:
Don’t delete results directories - Snakemake uses them for caching
If rerunning from scratch, use
--forceallinstead of deleting
Version control your configs:
Track
config/config.yamlandconfig/samples.jsonin gitDocument parameter changes for reproducibility
Next Steps#
Now that you understand the Snakemake pipeline:
Explore outputs: Review diagnostic figures in
results/diagnostic_figures/to assess data qualityCustomize analysis: Modify parameters in
config/config.yamlfor your experimental needsInteractive analysis: Load
.pklWAR files in Python for custom analysesLearn more: Check out other tutorials:
Windowed Analysis - Understanding feature extraction
Visualization - Creating custom plots
Basic Usage - Interactive Python workflow
Additional Resources#
Snakemake Profiles - Community-maintained scheduler profiles