Dataset Configuration#
NeuRodent supports multiple datasets and file formats through a flexible configuration system. This guide shows how to select and configure different datasets for your analyses.
Quick Start#
Switch datasets using environment variable:
# Run with Sox5 binary dataset
NEURODENT_DATASET=sox5_bin uv run snakemake --profile your-profile
# Run with AP3B2 NWB dataset
NEURODENT_DATASET=ap3b2_nwb uv run snakemake --profile your-profile
# Run with AP3B2 RHD dataset
NEURODENT_DATASET=ap3b2_rhd uv run snakemake --profile your-profile
How It Works#
File structure:
config/
├── config.yaml # Main config (shared settings)
├── config.local.yaml # Local overrides (gitignored)
├── datasets/ # Dataset-specific configs
│ ├── sox5_bin.yaml # Sox5 project, binary format
│ ├── ap3b2_nwb.yaml # AP3B2 project, NWB format
│ └── ap3b2_rhd.yaml # AP3B2 project, RHD format
└── samples*.json # Sample metadata files
How dataset selection works:
Main config (
config.yaml) contains shared settings for all datasetsLocal overrides (
config.local.yaml) are merged if the file exists (can setactive_dataset)Active dataset is resolved via
NEURODENT_DATASETenv var or theactive_datasetkeyDataset config is loaded from
config/datasets/{active_dataset}.yamlDataset config is deep-merged on top of the combined config
Merge order (later configs override earlier ones):
config.yaml → config.local.yaml → datasets/{active}.yaml
Switching Datasets#
There are three methods to select a dataset, each suited for different use cases.
Method 1: Environment Variable (Recommended for Cluster)#
Best for: Switching datasets per job, CI/CD pipelines, cluster batch jobs
NEURODENT_DATASET=ap3b2_rhd uv run snakemake --profile your-profile
Pros: No file editing, easy to switch, perfect for scripts
Cons: Must specify for every command
Method 2: Edit config.yaml#
Best for: Setting team-wide default dataset
Edit config/config.yaml and change the active_dataset line:
# config/config.yaml
active_dataset: "ap3b2_rhd" # Change this line
Pros: Set once, git-tracked
Cons: Affects all users who pull your changes
Method 3: Local Override#
Best for: Personal dataset preference
Edit config/config.local.yaml:
# config/config.local.yaml
active_dataset: "ap3b2_rhd"
Pros: Local-only (gitignored), doesn’t affect team
Cons: Can be forgotten
Priority Order#
If multiple methods are used, this is the priority:
Environment variable (
NEURODENT_DATASET) - highest priorityLocal config (
config.local.yaml)Main config (
config.yaml) - lowest priority
Verification#
Verify which dataset is active:
uv run snakemake --dry-run 2>&1 | head -20
Expected output (shows the dataset-specific overrides applied):
[ok] Using dataset: sox5_bin
Config file: config/datasets/sox5_bin.yaml
Dataset configuration overrides:
samples:
samples_file: "config/samples.json"
analysis:
war_generation:
...
Adding New Datasets#
Step 1: Create Samples JSON#
The samples JSON file defines the animals in your experiment and their metadata.
Create config/samples_mydata.json using the unified animals list format:
{
"data_root": "/path/to/your/data",
"animals": [
{"id": "M1", "gene": "WT", "sex": "M"},
{"id": "F3", "gene": "KO", "sex": "F"}
]
}
Animals Parameter Reference#
Each entry in the animals list is a dictionary. The following table
describes all available parameters:
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Unique identifier for the animal (e.g. |
|
string |
Yes |
Genotype label (e.g. |
|
string |
Yes |
Sex of the animal (e.g. |
|
string |
No |
Recording start datetime. Accepts any standard datetime format
including ISO 8601 (e.g. |
|
bool |
No |
Whether |
|
list or dict |
No |
Channels to exclude from analysis. Accepts two formats:
See Bad Channels for details. |
|
string |
No |
Per-animal file discovery pattern, overriding the global
|
|
dict |
No |
Per-animal keyword arguments passed to
|
|
dict |
No |
Per-animal keyword arguments for day/date parsing from filenames,
overriding the global |
Any additional keys (beyond those listed above) are passed through to
ANIMAL_METADATA and are available for custom downstream processing.
Top-Level Config Keys#
In addition to animals, the samples JSON supports these top-level keys:
Key |
Description |
|---|---|
|
Required. Root path containing the raw data directories. |
|
Mapping of |
|
Mapping of brain-region abbreviations to channel indices
(e.g. |
|
Explicit genotype → animal ID mapping. If omitted, it is
auto-generated from each animal’s |
|
Legacy top-level bad-channel dict (see Bad Channels).
Prefer per-animal |
|
Sessions where multiple animals were recorded simultaneously. |
Bad Channels#
Bad channels can be specified per animal in two ways.
Channels bad across all sessions — use a list:
{
"animals": [
{
"id": "M1", "gene": "WT", "sex": "M",
"bad_channels": ["LHip", "RHip"]
}
]
}
This is the simplest approach when the same channels are consistently noisy for a given animal.
Per-session bad channels — use a dict mapping session identifiers to channel lists:
{
"animals": [
{
"id": "M1", "gene": "WT", "sex": "M",
"bad_channels": {
"Session1": ["LHip", "RHip"],
"Session2": ["LHip", "RHip", "LMot"]
}
}
]
}
You can also combine both approaches: use the list for channels that are
broadly bad across sessions and add per-session entries for channels that
are only bad in specific recordings. When both _all (from a list) and
per-session entries are present, the pipeline merges them automatically.
Manual Datetimes#
Recording timestamps are specified via the manual_datetime field on
each animal entry. By default this is the recording start time.
{
"animals": [
{
"id": "M1", "gene": "WT", "sex": "M",
"manual_datetime": "2025-05-10T10:00:00"
}
]
}
This manual approach avoids the complexity and fragility of automatic
datetime parsing from heterogeneous filename formats. The value is parsed
by dateutil.parser.parse, so any standard format is accepted:
ISO 8601:
"2025-05-10T10:00:00"(recommended)Spaced:
"2025-05-10 10:00:00"Date only:
"2025-05-10"(midnight assumed)
Start vs. end time. By default manual_datetime is treated as the
recording start time. To indicate an end time instead, set
datetimes_are_start to false on the same animal entry:
{
"animals": [
{
"id": "M1", "gene": "WT", "sex": "M",
"manual_datetime": "2025-05-10T22:00:00",
"datetimes_are_start": false
}
]
}
Alternatively, datetimes_are_start can be set inside the lro_kwargs
dict on the animal entry or in the global lro_kwargs of the dataset
config YAML.
Full Example#
A complete samples JSON file with all available parameters:
{
"data_root": "/mnt/data/project",
"animals": [
{"id": "AM3", "gene": "WT", "sex": "Male"},
{"id": "AM5", "gene": "Het", "sex": "Male",
"bad_channels": ["LHip", "RHip"]},
{"id": "AP3B2homo-240-M", "gene": "HOMO", "sex": "Male",
"pattern": "{data_root}/PortA-*PortB-*/{animal}*_ColMajor_{index}.rhd",
"manual_datetime": "2025-11-27T15:39:05",
"lro_kwargs": {"mode": "si"},
"bad_channels": {
"Session_Nov27": ["LMot"],
"Session_Nov28": ["LMot", "RAud"]
}}
],
"LR_ALIASES": {"L": ["0", "1", "2", "3", "4"],
"R": ["5", "6", "7", "8", "9"]},
"CHNAME_ALIASES": {"Aud": ["0", "5"], "Vis": ["1", "6"],
"Hip": ["2", "7"], "Bar": ["3", "8"],
"Mot": ["4", "9"]}
}
Step 2: Create Dataset Config#
Create config/datasets/mydata_nwb.yaml:
# My Data - NWB Format
samples:
samples_file: "config/samples_mydata.json"
analysis:
war_generation:
pattern: "{animal}/{session}/{index}.nwb"
lro_kwargs:
mode: "si"
extract_func: "read_nwb_recording"
You can override any config parameter using the same hierarchy as the main config.
Step 3: Use the New Dataset#
NEURODENT_DATASET=mydata_nwb uv run snakemake --profile your-profile
Session-Specific Configuration#
Some datasets contain mixed file formats that require different processing parameters per recording session. Use the overrides.by_session section in your dataset config to override any parameter on a per-session basis.
Example: Dataset with both EDF and RHD formats
# config/datasets/mixed_format.yaml
samples:
samples_file: "config/samples_mixed.json"
analysis:
war_generation:
pattern: "{index}"
lro_kwargs:
mode: "si"
overrides:
by_session:
"Session_EDF":
"analysis.war_generation.pattern": "{index}.EDF"
"analysis.war_generation.lro_kwargs.mode": "mne"
"analysis.war_generation.lro_kwargs.extract_func": "read_raw_edf"
"Session_RHD":
"analysis.war_generation.pattern": "{index}.rhd"
"analysis.war_generation.lro_kwargs.extract_func": "read_intan"
"analysis.war_generation.lro_kwargs.stream_id": "0"
How it works:
Use dotted paths to specify which parameter to override (e.g.,
"analysis.war_generation.pattern")Can override any config parameter, not just war_generation settings
Overrides are applied via deep merge before session processing
Falls back to global config if no override specified for a session
Session-specific settings override the dataset config for that session only. This allows processing mixed formats in a single pipeline run while keeping all animals in unified analysis outputs.
Common Use Cases#
Team Default Dataset#
# Edit config.yaml: active_dataset: "sox5_bin"
git add config/config.yaml
git commit -m "Set default to Sox5"
git push
Personal Dataset Preference#
# Edit config.local.yaml: active_dataset: "ap3b2_nwb"
# This is gitignored - won't affect team
Cluster Batch Script#
#!/bin/bash
#SBATCH --job-name=neurodent
#SBATCH --time=24:00:00
export NEURODENT_DATASET=sox5_bin
uv run snakemake --profile slurm
Parallel Analysis#
# Terminal 1
NEURODENT_DATASET=sox5_bin uv run snakemake --profile slurm
# Terminal 2
NEURODENT_DATASET=ap3b2_nwb uv run snakemake --profile slurm
Troubleshooting#
Dataset Not Found#
Error:
FileNotFoundError: Dataset config file not found:
config/datasets/mydata.yaml
Solutions:
Check spelling (case-sensitive)
Verify file exists:
ls config/datasets/Create the missing dataset config
Wrong Dataset Active#
Check priority order:
echo $NEURODENT_DATASET # Check env var
grep active_dataset config/config.local.yaml
grep active_dataset config/config.yaml
# Clear env var if needed
unset NEURODENT_DATASET
See also: Snakemake Pipeline Setup for pipeline setup and SLURM configuration.