Running the Snakemake Pipeline#

This tutorial covers how to use NeuRodent’s Snakemake pipeline for large-scale, reproducible EEG analysis workflows.

What is Snakemake?#

Snakemake is a workflow management system that creates reproducible, scalable data analyses. Key features:

  • Automatic dependency tracking: Only recomputes what’s needed when data or parameters change

  • Parallel execution: Processes multiple samples simultaneously

  • Fault tolerance: Resumes from where it left off after failures

  • Resource management: Specify memory, CPU, and time requirements for each step

  • Cluster integration: Seamlessly scales from laptops to HPC clusters

Why Use the Snakemake Pipeline?#

For experiments with many animals and days of recordings, manual analysis becomes impractical. The NeuRodent pipeline automates:

  1. Raw EEG data → Windowed Analysis Results (WARs)

  2. Quality filtering and artifact removal

  3. Diagnostic figure generation

  4. Statistical analyses and publication-ready plots

  5. All intermediate processing steps

Installation#

Ensure you have NeuRodent and Snakemake installed:

# Install both with uv
uv add neurodent snakemake

# Or with pip
pip install neurodent snakemake

Pipeline Overview#

The pipeline processes data through multiple stages:

Raw EEG Data
    ↓
WAR Generation (windowed features)
    ↓
Quality Filtering (remove bad days/genotypes)
    ↓
Standardization (channel ordering)
    ↓
Fragment Filtering (temporal artifacts)
    ↓
Channel Filtering (spatial artifacts)
    ↓
WAR Flattening (aggregate across time)
    ↓
Zeitgeber Analysis (circadian features)
    ↓
Statistical Figures & Plots

Each stage produces intermediate files that are automatically cached. If you modify parameters or data, Snakemake only reruns affected steps.

Configuration#

The pipeline requires two configuration files in the config/ directory:

1. config/config.yaml - Analysis Parameters#

This file defines paths, analysis settings, and computational resources:

# Core directories
base_folder: "/path/to/neurodent/output"
data_parent_folder: "/path/to/raw/eeg/data"
temp_directory: "/path/to/temp/storage"

# Sample configuration
samples:
  samples_file: "config/samples.json"
  quality_filter:
    exclude_unknown_genotypes: true
    exclude_bad_animaldays: true

# Analysis parameters
analysis:
  war_generation:
    mode: "nest"
    lro_kwargs:
      mode: "bin"
      multiprocess_mode: "dask"
  
  fragment_filter_config:
    logrms_range:
      z_range: 3
    high_beta:
      max_beta_prop: 0.4
  
  channel_filter_config:
    lof:
      reject_lof_threshold: 2.5
      min_valid_channels: 3

# Resource requirements (adjust for your system)
cluster:
  war_generation:
    time: "3h"
    mem_mb: 70_000
    threads: 10

Key parameters:

  • Paths: Set base_folder, data_parent_folder, and temp_directory for your system

  • Analysis: Configure filtering thresholds, detection parameters, etc.

  • Resources: Memory, time, and CPU requirements (used for cluster scheduling)

2. config/samples.json - Sample Metadata#

This file defines which animals to process and their experimental metadata:

{
  "data_parent_folder": "/path/to/raw/eeg/data",
  "data_folders_to_animal_ids": {
    "experiment1_cohort1": ["M1", "M2", "F1", "F2"],
    "experiment2_cohort2": ["M3", "M4", "F3"]
  },
  "GENOTYPE_ALIASES": {
    "MWT": ["M1", "M3"],
    "FWT": ["F1", "F3"],
    "MMut": ["M2", "M4"],
    "FMut": ["F2"]
  },
  "bad_channels": {
    "experiment1_cohort1 M1": {
      "M1 MWT Jan-08-2022": ["LHip", "RHip"]
    }
  },
  "bad_folder_animalday": [
    "experiment1_cohort1 M2"
  ]
}

Key sections:

  • data_folders_to_animal_ids: Maps recording folders to animal IDs

  • GENOTYPE_ALIASES: Groups animals by genotype for statistical comparisons

  • bad_channels: Per-session bad channel lists for manual artifact rejection

  • bad_folder_animalday: Complete animals/days to exclude from analysis

Running the Pipeline#

Basic Execution#

Run the pipeline on your local machine or compute node:

# Dry run to preview what will execute
snakemake --dry-run

# Execute with 4 parallel jobs
snakemake --cores 4

# Run a specific analysis stage
snakemake --cores 4 results/wars_quality_filtered/

# Force rerun of a specific rule and downstream steps
snakemake --cores 4 --forcerun war_generation

Useful Commands#

# Visualize the workflow
snakemake --dag | dot -Tpng > dag.png
snakemake --rulegraph | dot -Tpng > rulegraph.png

# Unlock directory after crash
snakemake --unlock

# Clean up specific outputs to rerun
snakemake --delete-all-output

# Show which files are missing
snakemake --summary

Monitoring Progress#

Check the logs/ directory for detailed output from each rule:

# View real-time logs
tail -f logs/war_generation_*.log

# Check results directory
ls -lh results/

Running on HPC Clusters#

For large-scale analyses, you can submit pipeline jobs to an HPC cluster scheduler. Snakemake supports multiple schedulers including SLURM, SGE, PBS, LSF, and others.

General Cluster Execution#

Snakemake can interface with any job scheduler using profiles. The key is to:

  1. Define resource requirements in config.yaml (time, memory, threads)

  2. Create a cluster profile for your scheduler

  3. Submit the pipeline using the profile

Snakemake will automatically:

  • Submit each rule as a separate job

  • Manage job dependencies

  • Monitor job status

  • Resubmit failed jobs (with restart-times)

Example: SLURM Scheduler#

Here’s how to set up Snakemake with SLURM (adapt for your scheduler):

# Create a SLURM profile (one-time setup)
mkdir -p ~/.config/snakemake/slurm

cat > ~/.config/snakemake/slurm/config.yaml << 'EOF'
cluster: "sbatch --time={cluster.time} --mem={cluster.mem_mb} --cpus-per-task={threads} --job-name={rule} --output=logs/{rule}_%j.out"
jobs: 100
restart-times: 3
EOF

# Submit pipeline
snakemake --profile slurm

The pipeline reads resource requirements from your config.yaml:

cluster:
  war_generation:
    time: "3h"
    mem_mb: 70_000
    threads: 10

And Snakemake translates these to scheduler-specific flags.

Monitoring Cluster Jobs#

For SLURM:

# Check job queue
squeue -u $USER

# View job details
sacct --format=JobID,JobName,State,Elapsed,MaxRSS

# Monitor specific rule logs
tail -f logs/war_generation_*.out

For SGE:

# Check job queue
qstat -u $USER

# View job details
qacct -j <job_id>

For PBS:

# Check job queue
qstat -u $USER

# View job details
qstat -f <job_id>

Other Schedulers#

Snakemake supports many schedulers. See the Snakemake cluster documentation for:

  • SGE (Sun Grid Engine)

  • PBS/Torque

  • LSF (Load Sharing Facility)

  • Cloud executors (Google Cloud, AWS, Azure)

The general pattern is the same: create a profile with your scheduler’s submission command and resource mappings.

Pipeline Outputs#

The pipeline creates a structured results directory:

results/
├── wars_quality_filtered/       # Initial WARs with quality filtering applied
├── wars_standardized/            # Channel-reordered WARs
├── wars_fragment_filtered/       # After temporal artifact removal
├── wars_flattened_manual/        # Aggregated with manual channel filtering
├── wars_flattened_lof/           # Aggregated with LOF channel filtering
├── diagnostic_figures/           # Per-animal QC plots
│   └── {animal}/
│       ├── unfiltered/           # Pre-filtering diagnostics
│       └── filtered/             # Post-filtering diagnostics
├── wars_zeitgeber/               # Circadian time-aligned features
├── zeitgeber_plots/              # Temporal heatmaps
├── relfreq_plots/                # Feature distributions
├── ep_figures/                   # Statistical group comparisons
├── ep_heatmaps/                  # Connectivity matrices
├── lof_evaluation/               # LOF threshold optimization
└── filtering_comparison_plots/   # Manual vs LOF comparison

logs/
└── {rule}_*.log                  # Execution logs for each rule

Key outputs:

  • .pkl files: Python pickled WARs with computed features

  • .json files: Metadata and analysis parameters

  • Figures: PNG/PDF diagnostic and publication plots

  • CSV files: Exportable tabular data

Troubleshooting & Best Practices#

Common Issues#

Pipeline is locked:

snakemake --unlock

Need to rerun everything:

snakemake --forceall --cores 4

Jobs failing due to insufficient memory: Increase memory in config.yaml under cluster section for the failing rule.

Want to test changes without full pipeline:

# Run just one rule
snakemake --cores 4 war_generation

# Or target specific output
snakemake --cores 4 results/wars_quality_filtered/animal1/

Best Practices#

  1. Always dry-run first to preview execution:

    snakemake --dry-run
    
  2. Test on small subset before full dataset:

    • Temporarily modify samples.json to include just 1-2 animals

    • Run locally to verify parameters

    • Expand to full dataset once validated

  3. Monitor resource usage:

    • Check logs for memory errors

    • Adjust cluster resources in config.yaml as needed

    • Most rules should complete within their time limits

  4. Keep intermediate files:

    • Don’t delete results directories - Snakemake uses them for caching

    • If rerunning from scratch, use --forceall instead of deleting

  5. Version control your configs:

    • Track config/config.yaml and config/samples.json in git

    • Document parameter changes for reproducibility

Next Steps#

Now that you understand the Snakemake pipeline:

  • Explore outputs: Review diagnostic figures in results/diagnostic_figures/ to assess data quality

  • Customize analysis: Modify parameters in config/config.yaml for your experimental needs

  • Interactive analysis: Load .pkl WAR files in Python for custom analyses

  • Learn more: Check out other tutorials:

Additional Resources#