Dataset Configuration ===================== NeuRodent supports multiple datasets and file formats through a flexible configuration system. This guide shows how to select and configure different datasets for your analyses. Quick Start ----------- **Switch datasets using environment variable:** .. code-block:: bash # Run with Sox5 binary dataset NEURODENT_DATASET=sox5_bin uv run snakemake --profile your-profile # Run with AP3B2 NWB dataset NEURODENT_DATASET=ap3b2_nwb uv run snakemake --profile your-profile # Run with AP3B2 RHD dataset NEURODENT_DATASET=ap3b2_rhd uv run snakemake --profile your-profile How It Works ------------ **File structure:** .. code-block:: text config/ ├── config.yaml # Main config (shared settings) ├── config.local.yaml # Local overrides (gitignored) ├── datasets/ # Dataset-specific configs │ ├── sox5_bin.yaml # Sox5 project, binary format │ ├── ap3b2_nwb.yaml # AP3B2 project, NWB format │ └── ap3b2_rhd.yaml # AP3B2 project, RHD format └── samples*.json # Sample metadata files **How dataset selection works:** 1. Main config (``config.yaml``) contains shared settings for all datasets 2. Local overrides (``config.local.yaml``) are merged if the file exists (can set ``active_dataset``) 3. Active dataset is resolved via ``NEURODENT_DATASET`` env var or the ``active_dataset`` key 4. Dataset config is loaded from ``config/datasets/{active_dataset}.yaml`` 5. Dataset config is deep-merged on top of the combined config **Merge order** (later configs override earlier ones):: config.yaml → config.local.yaml → datasets/{active}.yaml Switching Datasets ------------------ There are three methods to select a dataset, each suited for different use cases. Method 1: Environment Variable (Recommended for Cluster) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Best for:** Switching datasets per job, CI/CD pipelines, cluster batch jobs .. code-block:: bash NEURODENT_DATASET=ap3b2_rhd uv run snakemake --profile your-profile **Pros:** No file editing, easy to switch, perfect for scripts **Cons:** Must specify for every command Method 2: Edit config.yaml ^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Best for:** Setting team-wide default dataset Edit ``config/config.yaml`` and change the ``active_dataset`` line: .. code-block:: yaml # config/config.yaml active_dataset: "ap3b2_rhd" # Change this line **Pros:** Set once, git-tracked **Cons:** Affects all users who pull your changes Method 3: Local Override ^^^^^^^^^^^^^^^^^^^^^^^^ **Best for:** Personal dataset preference Edit ``config/config.local.yaml``: .. code-block:: yaml # config/config.local.yaml active_dataset: "ap3b2_rhd" **Pros:** Local-only (gitignored), doesn't affect team **Cons:** Can be forgotten Priority Order -------------- If multiple methods are used, this is the priority: 1. **Environment variable** (``NEURODENT_DATASET``) - **highest priority** 2. **Local config** (``config.local.yaml``) 3. **Main config** (``config.yaml``) - **lowest priority** Verification ------------ Verify which dataset is active: .. code-block:: bash uv run snakemake --dry-run 2>&1 | head -20 Expected output (shows the dataset-specific overrides applied): .. code-block:: text [ok] Using dataset: sox5_bin Config file: config/datasets/sox5_bin.yaml Dataset configuration overrides: samples: samples_file: "config/samples.json" analysis: war_generation: ... Adding New Datasets ------------------- Step 1: Create Samples JSON ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The samples JSON file defines the animals in your experiment and their metadata. Create ``config/samples_mydata.json`` using the unified ``animals`` list format: .. code-block:: json { "data_root": "/path/to/your/data", "animals": [ {"id": "M1", "gene": "WT", "sex": "M"}, {"id": "F3", "gene": "KO", "sex": "F"} ] } .. _animals-parameter-reference: Animals Parameter Reference ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each entry in the ``animals`` list is a dictionary. The following table describes all available parameters: .. list-table:: :header-rows: 1 :widths: 20 15 10 55 * - Parameter - Type - Required - Description * - ``id`` - string - Yes - Unique identifier for the animal (e.g. ``"M1"``, ``"AP3B2homo-240-M"``). Used as the primary key throughout the pipeline. * - ``gene`` - string - Yes - Genotype label (e.g. ``"WT"``, ``"KO"``, ``"Het"``). Used to auto-generate ``GENOTYPE_ALIASES`` and for downstream grouping. * - ``sex`` - string - Yes - Sex of the animal (e.g. ``"M"``, ``"F"``, ``"Male"``, ``"Female"``). * - ``manual_datetime`` - string - No - Recording **start** datetime. Accepts any standard datetime format including ISO 8601 (e.g. ``"2025-05-10T10:00:00"``) and ``"YYYY-MM-DD HH:MM:SS"``. Use this when automatic datetime parsing from filenames is unreliable or when files lack embedded timestamps. See :ref:`manual-datetimes`. * - ``datetimes_are_start`` - bool - No - Whether ``manual_datetime`` represents the recording **start** time (``true``, default) or **end** time (``false``). See :ref:`manual-datetimes`. * - ``bad_channels`` - list or dict - No - Channels to exclude from analysis. Accepts two formats: * **List** — channels bad across *all* sessions: ``["LHip", "RHip"]`` * **Dict** — per-session bad channels: ``{"Session1": ["LHip"], "Session2": ["RMot"]}`` See :ref:`bad-channels` for details. * - ``pattern`` - string - No - Per-animal file discovery pattern, overriding the global ``pattern`` in the dataset config. Supports ``{data_root}``, ``{animal}``, and ``{index}`` placeholders (e.g. ``"{data_root}/custom/{animal}_{index}.rhd"``). * - ``lro_kwargs`` - dict - No - Per-animal keyword arguments passed to ``LongRecordingOrganizer``, overriding the global ``lro_kwargs``. Useful when an animal's files require different loading parameters (e.g. ``{"mode": "si", "extract_func": "read_intan"}``). * - ``day_parse_kwargs`` - dict - No - Per-animal keyword arguments for day/date parsing from filenames, overriding the global ``day_parse_kwargs`` (e.g. ``{"date_patterns": [["\\d{6}", "%y%m%d"]]}``). Any additional keys (beyond those listed above) are passed through to ``ANIMAL_METADATA`` and are available for custom downstream processing. Top-Level Config Keys ~~~~~~~~~~~~~~~~~~~~~ In addition to ``animals``, the samples JSON supports these top-level keys: .. list-table:: :header-rows: 1 :widths: 25 55 * - Key - Description * - ``data_root`` - **Required.** Root path containing the raw data directories. * - ``LR_ALIASES`` - Mapping of ``"L"``/``"R"`` labels to channel indices (e.g. ``{"L": ["0","1","2"], "R": ["5","6","7"]}``). * - ``CHNAME_ALIASES`` - Mapping of brain-region abbreviations to channel indices (e.g. ``{"Aud": ["0","5"], "Hip": ["2","7"]}``). * - ``GENOTYPE_ALIASES`` - Explicit genotype → animal ID mapping. If omitted, it is auto-generated from each animal's ``gene`` field. * - ``bad_channels`` - Legacy top-level bad-channel dict (see :ref:`bad-channels`). Prefer per-animal ``bad_channels`` in the ``animals`` list. * - ``joint_sessions`` - Sessions where multiple animals were recorded simultaneously. .. _bad-channels: Bad Channels ~~~~~~~~~~~~ Bad channels can be specified per animal in two ways. **Channels bad across all sessions** — use a list: .. code-block:: json { "animals": [ { "id": "M1", "gene": "WT", "sex": "M", "bad_channels": ["LHip", "RHip"] } ] } This is the simplest approach when the same channels are consistently noisy for a given animal. **Per-session bad channels** — use a dict mapping session identifiers to channel lists: .. code-block:: json { "animals": [ { "id": "M1", "gene": "WT", "sex": "M", "bad_channels": { "Session1": ["LHip", "RHip"], "Session2": ["LHip", "RHip", "LMot"] } } ] } You can also combine both approaches: use the list for channels that are broadly bad across sessions and add per-session entries for channels that are only bad in specific recordings. When both ``_all`` (from a list) and per-session entries are present, the pipeline merges them automatically. .. _manual-datetimes: Manual Datetimes ~~~~~~~~~~~~~~~~ Recording timestamps are specified via the ``manual_datetime`` field on each animal entry. **By default this is the recording start time.** .. code-block:: json { "animals": [ { "id": "M1", "gene": "WT", "sex": "M", "manual_datetime": "2025-05-10T10:00:00" } ] } This manual approach avoids the complexity and fragility of automatic datetime parsing from heterogeneous filename formats. The value is parsed by ``dateutil.parser.parse``, so any standard format is accepted: * ISO 8601: ``"2025-05-10T10:00:00"`` (recommended) * Spaced: ``"2025-05-10 10:00:00"`` * Date only: ``"2025-05-10"`` (midnight assumed) **Start vs. end time.** By default ``manual_datetime`` is treated as the recording **start** time. To indicate an **end** time instead, set ``datetimes_are_start`` to ``false`` on the same animal entry: .. code-block:: json { "animals": [ { "id": "M1", "gene": "WT", "sex": "M", "manual_datetime": "2025-05-10T22:00:00", "datetimes_are_start": false } ] } Alternatively, ``datetimes_are_start`` can be set inside the ``lro_kwargs`` dict on the animal entry or in the global ``lro_kwargs`` of the dataset config YAML. Full Example ~~~~~~~~~~~~ A complete samples JSON file with all available parameters: .. code-block:: json { "data_root": "/mnt/data/project", "animals": [ {"id": "AM3", "gene": "WT", "sex": "Male"}, {"id": "AM5", "gene": "Het", "sex": "Male", "bad_channels": ["LHip", "RHip"]}, {"id": "AP3B2homo-240-M", "gene": "HOMO", "sex": "Male", "pattern": "{data_root}/PortA-*PortB-*/{animal}*_ColMajor_{index}.rhd", "manual_datetime": "2025-11-27T15:39:05", "lro_kwargs": {"mode": "si"}, "bad_channels": { "Session_Nov27": ["LMot"], "Session_Nov28": ["LMot", "RAud"] }} ], "LR_ALIASES": {"L": ["0", "1", "2", "3", "4"], "R": ["5", "6", "7", "8", "9"]}, "CHNAME_ALIASES": {"Aud": ["0", "5"], "Vis": ["1", "6"], "Hip": ["2", "7"], "Bar": ["3", "8"], "Mot": ["4", "9"]} } Step 2: Create Dataset Config ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Create ``config/datasets/mydata_nwb.yaml``: .. code-block:: yaml # My Data - NWB Format samples: samples_file: "config/samples_mydata.json" analysis: war_generation: pattern: "{animal}/{session}/{index}.nwb" lro_kwargs: mode: "si" extract_func: "read_nwb_recording" You can override **any** config parameter using the same hierarchy as the main config. Step 3: Use the New Dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash NEURODENT_DATASET=mydata_nwb uv run snakemake --profile your-profile Session-Specific Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Some datasets contain mixed file formats that require different processing parameters per recording session. Use the ``overrides.by_session`` section in your dataset config to override any parameter on a per-session basis. **Example:** Dataset with both EDF and RHD formats .. code-block:: yaml # config/datasets/mixed_format.yaml samples: samples_file: "config/samples_mixed.json" analysis: war_generation: pattern: "{index}" lro_kwargs: mode: "si" overrides: by_session: "Session_EDF": "analysis.war_generation.pattern": "{index}.EDF" "analysis.war_generation.lro_kwargs.mode": "mne" "analysis.war_generation.lro_kwargs.extract_func": "read_raw_edf" "Session_RHD": "analysis.war_generation.pattern": "{index}.rhd" "analysis.war_generation.lro_kwargs.extract_func": "read_intan" "analysis.war_generation.lro_kwargs.stream_id": "0" **How it works:** - Use dotted paths to specify which parameter to override (e.g., ``"analysis.war_generation.pattern"``) - Can override **any** config parameter, not just war_generation settings - Overrides are applied via deep merge before session processing - Falls back to global config if no override specified for a session Session-specific settings override the dataset config for that session only. This allows processing mixed formats in a single pipeline run while keeping all animals in unified analysis outputs. Common Use Cases ---------------- Team Default Dataset ^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Edit config.yaml: active_dataset: "sox5_bin" git add config/config.yaml git commit -m "Set default to Sox5" git push Personal Dataset Preference ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Edit config.local.yaml: active_dataset: "ap3b2_nwb" # This is gitignored - won't affect team Cluster Batch Script ^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash #!/bin/bash #SBATCH --job-name=neurodent #SBATCH --time=24:00:00 export NEURODENT_DATASET=sox5_bin uv run snakemake --profile slurm Parallel Analysis ^^^^^^^^^^^^^^^^^ .. code-block:: bash # Terminal 1 NEURODENT_DATASET=sox5_bin uv run snakemake --profile slurm # Terminal 2 NEURODENT_DATASET=ap3b2_nwb uv run snakemake --profile slurm Troubleshooting --------------- Dataset Not Found ^^^^^^^^^^^^^^^^^ **Error:** .. code-block:: text FileNotFoundError: Dataset config file not found: config/datasets/mydata.yaml **Solutions:** - Check spelling (case-sensitive) - Verify file exists: ``ls config/datasets/`` - Create the missing dataset config Wrong Dataset Active ^^^^^^^^^^^^^^^^^^^^ **Check priority order:** .. code-block:: bash echo $NEURODENT_DATASET # Check env var grep active_dataset config/config.local.yaml grep active_dataset config/config.yaml # Clear env var if needed unset NEURODENT_DATASET See also: :doc:`snakemake_setup` for pipeline setup and SLURM configuration.