Dataset Configuration#

NeuRodent supports multiple datasets and file formats through a flexible configuration system. This guide shows how to select and configure different datasets for your analyses.

Quick Start#

Switch datasets using environment variable:

# Run with Sox5 binary dataset
NEURODENT_DATASET=sox5_bin uv run snakemake --profile your-profile

# Run with AP3B2 NWB dataset
NEURODENT_DATASET=ap3b2_nwb uv run snakemake --profile your-profile

# Run with AP3B2 RHD dataset
NEURODENT_DATASET=ap3b2_rhd uv run snakemake --profile your-profile

How It Works#

File structure:

config/
├── config.yaml               # Main config (shared settings)
├── config.local.yaml         # Local overrides (gitignored)
├── datasets/                 # Dataset-specific configs
│   ├── sox5_bin.yaml        # Sox5 project, binary format
│   ├── ap3b2_nwb.yaml       # AP3B2 project, NWB format
│   └── ap3b2_rhd.yaml       # AP3B2 project, RHD format
└── samples*.json            # Sample metadata files

How dataset selection works:

  1. Main config (config.yaml) contains shared settings for all datasets

  2. Local overrides (config.local.yaml) are merged if the file exists (can set active_dataset)

  3. Active dataset is resolved via NEURODENT_DATASET env var or the active_dataset key

  4. Dataset config is loaded from config/datasets/{active_dataset}.yaml

  5. Dataset config is deep-merged on top of the combined config

Merge order (later configs override earlier ones):

config.yaml → config.local.yaml → datasets/{active}.yaml

Switching Datasets#

There are three methods to select a dataset, each suited for different use cases.

Method 2: Edit config.yaml#

Best for: Setting team-wide default dataset

Edit config/config.yaml and change the active_dataset line:

# config/config.yaml
active_dataset: "ap3b2_rhd"  # Change this line

Pros: Set once, git-tracked

Cons: Affects all users who pull your changes

Method 3: Local Override#

Best for: Personal dataset preference

Edit config/config.local.yaml:

# config/config.local.yaml
active_dataset: "ap3b2_rhd"

Pros: Local-only (gitignored), doesn’t affect team

Cons: Can be forgotten

Priority Order#

If multiple methods are used, this is the priority:

  1. Environment variable (NEURODENT_DATASET) - highest priority

  2. Local config (config.local.yaml)

  3. Main config (config.yaml) - lowest priority

Verification#

Verify which dataset is active:

uv run snakemake --dry-run 2>&1 | head -20

Expected output (shows the dataset-specific overrides applied):

[ok] Using dataset: sox5_bin
  Config file: config/datasets/sox5_bin.yaml

  Dataset configuration overrides:
    samples:
      samples_file: "config/samples.json"
    analysis:
      war_generation:
        ...

Adding New Datasets#

Step 1: Create Samples JSON#

The samples JSON file defines the animals in your experiment and their metadata. Create config/samples_mydata.json using the unified animals list format:

{
    "data_root": "/path/to/your/data",
    "animals": [
        {"id": "M1", "gene": "WT", "sex": "M"},
        {"id": "F3", "gene": "KO", "sex": "F"}
    ]
}

Animals Parameter Reference#

Each entry in the animals list is a dictionary. The following table describes all available parameters:

Parameter

Type

Required

Description

id

string

Yes

Unique identifier for the animal (e.g. "M1", "AP3B2homo-240-M"). Used as the primary key throughout the pipeline.

gene

string

Yes

Genotype label (e.g. "WT", "KO", "Het"). Used to auto-generate GENOTYPE_ALIASES and for downstream grouping.

sex

string

Yes

Sex of the animal (e.g. "M", "F", "Male", "Female").

manual_datetime

string

No

Recording start datetime. Accepts any standard datetime format including ISO 8601 (e.g. "2025-05-10T10:00:00") and "YYYY-MM-DD HH:MM:SS". Use this when automatic datetime parsing from filenames is unreliable or when files lack embedded timestamps. See Manual Datetimes.

datetimes_are_start

bool

No

Whether manual_datetime represents the recording start time (true, default) or end time (false). See Manual Datetimes.

bad_channels

list or dict

No

Channels to exclude from analysis. Accepts two formats:

  • List — channels bad across all sessions: ["LHip", "RHip"]

  • Dict — per-session bad channels: {"Session1": ["LHip"], "Session2": ["RMot"]}

See Bad Channels for details.

pattern

string

No

Per-animal file discovery pattern, overriding the global pattern in the dataset config. Supports {data_root}, {animal}, and {index} placeholders (e.g. "{data_root}/custom/{animal}_{index}.rhd").

lro_kwargs

dict

No

Per-animal keyword arguments passed to LongRecordingOrganizer, overriding the global lro_kwargs. Useful when an animal’s files require different loading parameters (e.g. {"mode": "si", "extract_func": "read_intan"}).

day_parse_kwargs

dict

No

Per-animal keyword arguments for day/date parsing from filenames, overriding the global day_parse_kwargs (e.g. {"date_patterns": [["\\d{6}", "%y%m%d"]]}).

Any additional keys (beyond those listed above) are passed through to ANIMAL_METADATA and are available for custom downstream processing.

Top-Level Config Keys#

In addition to animals, the samples JSON supports these top-level keys:

Key

Description

data_root

Required. Root path containing the raw data directories.

LR_ALIASES

Mapping of "L"/"R" labels to channel indices (e.g. {"L": ["0","1","2"], "R": ["5","6","7"]}).

CHNAME_ALIASES

Mapping of brain-region abbreviations to channel indices (e.g. {"Aud": ["0","5"], "Hip": ["2","7"]}).

GENOTYPE_ALIASES

Explicit genotype → animal ID mapping. If omitted, it is auto-generated from each animal’s gene field.

bad_channels

Legacy top-level bad-channel dict (see Bad Channels). Prefer per-animal bad_channels in the animals list.

joint_sessions

Sessions where multiple animals were recorded simultaneously.

Bad Channels#

Bad channels can be specified per animal in two ways.

Channels bad across all sessions — use a list:

{
    "animals": [
        {
            "id": "M1", "gene": "WT", "sex": "M",
            "bad_channels": ["LHip", "RHip"]
        }
    ]
}

This is the simplest approach when the same channels are consistently noisy for a given animal.

Per-session bad channels — use a dict mapping session identifiers to channel lists:

{
    "animals": [
        {
            "id": "M1", "gene": "WT", "sex": "M",
            "bad_channels": {
                "Session1": ["LHip", "RHip"],
                "Session2": ["LHip", "RHip", "LMot"]
            }
        }
    ]
}

You can also combine both approaches: use the list for channels that are broadly bad across sessions and add per-session entries for channels that are only bad in specific recordings. When both _all (from a list) and per-session entries are present, the pipeline merges them automatically.

Manual Datetimes#

Recording timestamps are specified via the manual_datetime field on each animal entry. By default this is the recording start time.

{
    "animals": [
        {
            "id": "M1", "gene": "WT", "sex": "M",
            "manual_datetime": "2025-05-10T10:00:00"
        }
    ]
}

This manual approach avoids the complexity and fragility of automatic datetime parsing from heterogeneous filename formats. The value is parsed by dateutil.parser.parse, so any standard format is accepted:

  • ISO 8601: "2025-05-10T10:00:00" (recommended)

  • Spaced: "2025-05-10 10:00:00"

  • Date only: "2025-05-10" (midnight assumed)

Start vs. end time. By default manual_datetime is treated as the recording start time. To indicate an end time instead, set datetimes_are_start to false on the same animal entry:

{
    "animals": [
        {
            "id": "M1", "gene": "WT", "sex": "M",
            "manual_datetime": "2025-05-10T22:00:00",
            "datetimes_are_start": false
        }
    ]
}

Alternatively, datetimes_are_start can be set inside the lro_kwargs dict on the animal entry or in the global lro_kwargs of the dataset config YAML.

Full Example#

A complete samples JSON file with all available parameters:

{
    "data_root": "/mnt/data/project",
    "animals": [
        {"id": "AM3", "gene": "WT", "sex": "Male"},
        {"id": "AM5", "gene": "Het", "sex": "Male",
         "bad_channels": ["LHip", "RHip"]},
        {"id": "AP3B2homo-240-M", "gene": "HOMO", "sex": "Male",
         "pattern": "{data_root}/PortA-*PortB-*/{animal}*_ColMajor_{index}.rhd",
         "manual_datetime": "2025-11-27T15:39:05",
         "lro_kwargs": {"mode": "si"},
         "bad_channels": {
             "Session_Nov27": ["LMot"],
             "Session_Nov28": ["LMot", "RAud"]
         }}
    ],
    "LR_ALIASES": {"L": ["0", "1", "2", "3", "4"],
                    "R": ["5", "6", "7", "8", "9"]},
    "CHNAME_ALIASES": {"Aud": ["0", "5"], "Vis": ["1", "6"],
                       "Hip": ["2", "7"], "Bar": ["3", "8"],
                       "Mot": ["4", "9"]}
}

Step 2: Create Dataset Config#

Create config/datasets/mydata_nwb.yaml:

# My Data - NWB Format

samples:
  samples_file: "config/samples_mydata.json"

analysis:
  war_generation:
    pattern: "{animal}/{session}/{index}.nwb"
    lro_kwargs:
      mode: "si"
      extract_func: "read_nwb_recording"

You can override any config parameter using the same hierarchy as the main config.

Step 3: Use the New Dataset#

NEURODENT_DATASET=mydata_nwb uv run snakemake --profile your-profile

Session-Specific Configuration#

Some datasets contain mixed file formats that require different processing parameters per recording session. Use the overrides.by_session section in your dataset config to override any parameter on a per-session basis.

Example: Dataset with both EDF and RHD formats

# config/datasets/mixed_format.yaml
samples:
  samples_file: "config/samples_mixed.json"

analysis:
  war_generation:
    pattern: "{index}"
    lro_kwargs:
      mode: "si"

overrides:
  by_session:
    "Session_EDF":
      "analysis.war_generation.pattern": "{index}.EDF"
      "analysis.war_generation.lro_kwargs.mode": "mne"
      "analysis.war_generation.lro_kwargs.extract_func": "read_raw_edf"
    "Session_RHD":
      "analysis.war_generation.pattern": "{index}.rhd"
      "analysis.war_generation.lro_kwargs.extract_func": "read_intan"
      "analysis.war_generation.lro_kwargs.stream_id": "0"

How it works:

  • Use dotted paths to specify which parameter to override (e.g., "analysis.war_generation.pattern")

  • Can override any config parameter, not just war_generation settings

  • Overrides are applied via deep merge before session processing

  • Falls back to global config if no override specified for a session

Session-specific settings override the dataset config for that session only. This allows processing mixed formats in a single pipeline run while keeping all animals in unified analysis outputs.

Common Use Cases#

Team Default Dataset#

# Edit config.yaml: active_dataset: "sox5_bin"
git add config/config.yaml
git commit -m "Set default to Sox5"
git push

Personal Dataset Preference#

# Edit config.local.yaml: active_dataset: "ap3b2_nwb"
# This is gitignored - won't affect team

Cluster Batch Script#

#!/bin/bash
#SBATCH --job-name=neurodent
#SBATCH --time=24:00:00

export NEURODENT_DATASET=sox5_bin
uv run snakemake --profile slurm

Parallel Analysis#

# Terminal 1
NEURODENT_DATASET=sox5_bin uv run snakemake --profile slurm

# Terminal 2
NEURODENT_DATASET=ap3b2_nwb uv run snakemake --profile slurm

Troubleshooting#

Dataset Not Found#

Error:

FileNotFoundError: Dataset config file not found:
config/datasets/mydata.yaml

Solutions:

  • Check spelling (case-sensitive)

  • Verify file exists: ls config/datasets/

  • Create the missing dataset config

Wrong Dataset Active#

Check priority order:

echo $NEURODENT_DATASET        # Check env var
grep active_dataset config/config.local.yaml
grep active_dataset config/config.yaml

# Clear env var if needed
unset NEURODENT_DATASET

See also: Snakemake Pipeline Setup for pipeline setup and SLURM configuration.