Tutorial#

Getting started#

In this example we will make a workflow to smooth bold scans from a bids dataset.

We will start by creating a simple rule, then make this more generalizable in each step. To begin with, this is the command we are using to smooth a bold scan.

fslmaths ../bids/sub-001/func/sub-001_task-rest_run-1_bold.nii.gz -s 2.12 results/sub-001/func/sub-001_task-rest_run-1_fwhm-5mm_bold.nii.gz

This command performs smoothing with a sigma=2.12 Gaussian kernel (equivalent to 5mm FWHM, with sigma=fwhm/2.355), and saves the smoothed file as results/sub-001/func/sub-001_task-rest_run-1_fwhm-5mm_bold.nii.gz.

Installation#

Start by making a new directory:

$ mkdir snakebids-tutorial
$ cd snakebids-tutorial

Check your python version to make sure you have at least version 3.7 or higher:

$ python --version
Python 3.10.0

Make a new virtual environment:

$ python -m venv .venv
$ source .venv/bin/activate

And use pip to install snakebids:

$ pip install snakebids

In our example, we’ll be using the fslmaths tool from FSL. If you want to actually run the workflow, you’ll need to have FSL installed. This is not actually necessary to follow along the tutorial however, as we can use “dry runs” to see what snakemake would do if FSL were installed.

Getting the dataset#

We will be running the tutorial on a test dataset consisting only of empty files. We won’t actually be able to run our workflow on it (fslmaths will fail), but as mentioned above, we can use dry runs to see would would normally happen.

If you wish to follow along using the same dataset, currently the easiest way is to start by cloning snakebids:

$ git clone https://github.com/khanlab/snakebids.git

Then copy the following directory:

$ cp -r snakebids/docs/tutorial/bids ./data

It’s also perfectly possible (and probably better!) to try the tutorial on your own dataset. Just adjust any paths below so that they match your data!

Part I: Snakemake#

Step 0: a basic non-generic workflow#

In this rule, we start by creating a rule that is effectively hard-coding the paths for input and output to re-create the command as above.

In this rule we have an input: section for input files, a params: section for non-file parameters, and an output: section for output files. The shell section is used to build the shell command, and can refer to the input, output or params using curly braces. Note that any of these can also be named inputs, but we have only used this for the sigma parameter in this case.

Snakefile#
1rule smooth:
2    input:
3        'data/sub-001/func/sub-001_task-rest_run-1_bold.nii.gz'
4    params:
5        sigma = '2.12'
6    output:
7        'results/sub-001/func/sub-001_task-rest_run-1_fwhm-5mm_bold.nii.gz'
8    shell:
9        'fslmaths {input} -s {params.sigma} {output}'

With this rule in our Snakefile, we can then run snakemake -np to execute a dry-run of the workflow. Here the -n specifies dry-run, and the -p prints any shell commands that are to be executed.

When we invoke snakemake, it uses the first rule in the snakefile as the target rule. The target rule is what snakemake uses as a starting point to determine what other rules need to be run in order to generate the inputs. We’ll learn a bit more about that in the next step.

So far, we just have a fancy way of specifying the exact same command we started with, so there is no added benefit (yet). But we will soon add to this rule to make it more generalizable.

Step 1: adding wildcards#

First step to make the workflow generalizeable is to replace the hard-coded identifiers (e.g. the subject, task and run) with wildcards.

In the Snakefile, we can replace sub-001 with sub-{subject}, and so forth for task and run. Now the rule is generic for any subject, task, or run.

Snakefile#
1rule smooth:
2    input:
3        'data/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_bold.nii.gz'
4    params:
5        sigma = '2.12'
6    output:
7        'results/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_fwhm-5mm_bold.nii.gz'
8    shell:
9        'fslmaths {input} -s {params.sigma} {output}'

However, if we try to execute (dry-run) the workflow as before, we get an error. This is because the target rule now has wildcards in it. So snakemake is unable to determine what rules need to be run to generate the inputs, since the wildcards can take any value.

So for the time being, we will make use of the snakemake command-line argument to specify targets, and specify the file we want generated from the command-line, by running:

$ snakemake -np results/sub-001/func/sub-001_task-rest_run-1_fwhm-5mm_bold.nii.gz

We can now even try running this for another subject by changing the target file.

$ snakemake -np results/sub-002/func/sub-002_task-rest_run-1_fwhm-5mm_bold.nii.gz

Try using a subject that doesn’t exist in our bids dataset, what happens?

Now, try changing the output smoothing value, e.g. fwhm-10mm, and see what happens. As expected the command still uses a smoothing value of 2.12, since that has been hard-coded, but we will see how to rectify this in the next step.

Step 2: adding a params function#

As we noted, the sigma parameter needs to be computed from the FWHM. We can use a function to do this. Functions can be used for any input or params, and must take wildcards as an input argument, which provides a mechanism to pass the wildcards (determined from the output file) to the function.

We can thus define a simple function that returns a string representing FWHM/2.355 as follows:

Snakefile#
1def calc_sigma_from_fwhm(wildcards):
2    return f'{float(wildcards.fwhm)/2.355:0.2f}'

Note 1: We now have to make the fwhm in the output filename a wildcard, so that it can be passed to the function (via the wildcards object).

Note 2: We have to convert the fwhm to float, since all wildcards are always strings (since they are parsed from the output filename).

Once we have this function, we can replace the hardcoded 2.12 with the name of the function:

Snakefile#
6        'data/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_bold.nii.gz'
7    params:
8        sigma = calc_sigma_from_fwhm
9    output:

Here is the full Snakefile:

Snakefile#
 1def calc_sigma_from_fwhm(wildcards):
 2    return f'{float(wildcards.fwhm)/2.355:0.2f}'
 3
 4rule smooth:
 5    input:
 6        'data/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_bold.nii.gz'
 7    params:
 8        sigma = calc_sigma_from_fwhm
 9    output:
10        'results/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_fwhm-{fwhm}mm_bold.nii.gz'
11    shell:
12        'fslmaths {input} -s {params.sigma} {output}'

Now try running the workflow again, with fwhm-5 as well as fwhm-10.

Step 3: adding a target rule#

Now we have a generic rule, but it is pretty tedious to have to type out the filename of each target from the command-line in order to use it.

This is where target rules come in. If you recall from earlier, the first rule in a workflow is interpreted as the target rule, so we just need to add a dummy rule to the Snakefile that has all the target files as inputs. It is a dummy rule since it doesn’t have any outputs or any command to run itself, but snakemake will take these input files, and determine if any other rules in the workflow can generate them (considering any wildcards too).

In this case, we have a BIDS dataset with two runs (run-1, run-2), and suppose we wanted to compute smoothing with several different FWHM kernels (5,10,15,20). We can thus make a target rule that has all these resulting filenames as inputs.

A very useful function in snakemake is expand(). It is a way to perform array expansion to create lists of strings (input filenames).

Snakefile#
 1rule all:
 2    input:
 3        expand(
 4            'results/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_fwhm-{fwhm}mm_bold.nii.gz',
 5            subject='001',
 6            task='rest',
 7            run=[1,2],
 8            fwhm=[5,10,15,20]
 9        )
10
11

Now, we don’t need to specify any targets from the command-line, and can just run:

$ snakemake -np

The entire Snakefile for reference is:

Snakefile#
 1rule all:
 2    input:
 3        expand(
 4            'results/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_fwhm-{fwhm}mm_bold.nii.gz',
 5            subject='001',
 6            task='rest',
 7            run=[1,2],
 8            fwhm=[5,10,15,20]
 9        )
10
11
12def calc_sigma_from_fwhm(wildcards):
13    return f'{float(wildcards.fwhm)/2.355:0.2f}'
14
15rule smooth:
16    input:
17        'data/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_bold.nii.gz'
18    params:
19        sigma = calc_sigma_from_fwhm,
20    output:
21        'results/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_fwhm-{fwhm}mm_bold.nii.gz'
22    shell:
23        'fslmaths {input} -s {params.sigma} {output}'

Step 4: adding a config file#

We have a functional workflow, but suppose you need to configure or run it on another bids dataset with different subjects, tasks, runs, or you want to run it for different smoothing values. You have to actually modify your workflow in order to do this.

It is a better practice instead to keep your configuration variables separate from the actual workflow. Snakemake supports this by allowing for a separate config file (can be YAML or JSON, here we will use YAML), where we can store any dataset specific configuration. Then to apply it for a new purpose, you can simply update the config file.

To do this, we simply add a line to our workflow:

Snakefile#
1configfile: 'config.yml'
2
3rule all:
4    input:

Snakemake will then handle reading it in, and making the configuration variables available via dictionary called config.

In our config file, we will add variables for everything in the target rule expand():

config.yaml#
 1subjects:
 2  - '001'
 3
 4tasks:
 5  - rest
 6
 7runs:
 8  - 1
 9  - 2
10
11fwhm:
12  - 5
13  - 10
14  - 15
15  - 20
16
17
18in_bold: 'data/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_bold.nii.gz'

In our Snakefile, we then need to replace these hardcoded values with config[key]. The entire updated Snakefile is shown here:

Snakefile#
 1configfile: 'config.yml'
 2
 3rule all:
 4    input:
 5        expand(
 6            'results/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_fwhm-{fwhm}mm_bold.nii.gz',
 7            subject=config['subjects'],
 8            task=config['tasks'],
 9            run=config['runs'],
10            fwhm=config['fwhm']
11        )
12
13
14def calc_sigma_from_fwhm(wildcards):
15    return f'{float(wildcards.fwhm)/2.355:0.2f}'
16
17rule smooth:
18    input:
19        config['in_bold']
20    params:
21        sigma = calc_sigma_from_fwhm
22    output:
23        'results/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_fwhm-{fwhm}mm_bold.nii.gz'
24    shell:
25        'fslmaths {input} -s {params.sigma} {output}'

After these changes, the workflow should still run just like the last step, but now you can make any changes via the config file.

Part II: Snakebids#

Now that we have a fully functioning and generic Snakemake workflow, let’s see what Snakebids can add.

Step 5: the bids() function#

The first thing we can make use of is the bids() function. This provides an easy way to generate bids filenames. This is especially useful when defining output files in your workflow and you have many bids entities.

In our existing workflow, this was our output file:

Snakefile#
22    output:
23        'results/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_fwhm-{fwhm}mm_bold.nii.gz'

To create the same path using bids(), we just need to specify the root directory (results), all the bids tags (subject, task, run, fwhm), and the suffix (which includes the extension):

Snakefile#
31    output:
32        bids(
33            root='results',
34            subject='{subject}',
35            task='{task}',
36            run='{run}',
37            fwhm='{fwhm}',
38            suffix='bold.nii.gz',
39        )

Note

To make a snakemake wildcard, we wrapped the 'value' in curly braces (e.g. '{value}').

Note

The entities you supply in the bids() function do not have to be in the BIDS specification, e.g. fwhm is not in the spec. But if you do use entities that are in the BIDS specification, snakebids will ensure they go in the correct order (e.g. sub_*-ses_*-run_*...). Non-standard entities will be added to the end of the path, just before the suffix, in the order they were defined.

The Snakefile with the output filename replaced (in both rules) is below:

Snakefile#
 1from snakebids import bids
 2
 3configfile: 'config.yml'
 4
 5rule all:
 6    input:
 7        expand(
 8            bids(
 9                root='results',
10                subject='{subject}',
11                task='{task}',
12                run='{run}',
13                fwhm='{fwhm}',
14                suffix='bold.nii.gz'
15            ),
16            subject=config['subjects'],
17            task=config['tasks'],
18            run=config['runs'],
19            fwhm=config['fwhm'],
20        )
21
22
23def calc_sigma_from_fwhm(wildcards):
24    return f'{float(wildcards.fwhm)/2.355:0.2f}'
25
26rule smooth:
27    input:
28        config['in_bold']
29    params:
30        sigma = calc_sigma_from_fwhm
31    output:
32        bids(
33            root='results',
34            subject='{subject}',
35            task='{task}',
36            run='{run}',
37            fwhm='{fwhm}',
38            suffix='bold.nii.gz',
39        )
40    shell:
41        'fslmaths {input} -s {params.sigma} {output}'

Step 6: parsing the BIDS dataset#

So far, we have had to manually enter the path to input bold file in the config file, and also specify what subjects, tasks, and runs we want processed. Can’t we use the fact that we have a BIDS dataset to automate this a bit more?

With Snakemake, there are ways to glob the files to figure out what wildcards are present (e.g. glob_wildcards()), however, this is not so straightforward with BIDS, since filenames in BIDS often have optional components. E.g. some datasets may have a ses tag/sub-directory, and others do not. Also there are often optional user-defined values, such as the acq tag, that a workflow in most cases should ignore. Thus, the input that we use in our workflow, in_bold, that has wildcards to be generic, would need to be altered for any given BIDS dataset, along with the workflow itself, making this automated BIDS parsing difficult within Snakemake.

Snakebids lets you parse a bids dataset (using pybids under the hood) using a configfile that contains the required wildcards, along with data structures that specify all the wildcard values for all the subjects. This, in combination with the bids() function, can allow one to make snakemake workflows that are compatible with any general bids dataset.

To add this parsing to the workflow, we call the generate_inputs() function before our rules are defined, and pass along some configuration data to specify the location of the bids directory (bids_dir) and the inputs we want to parse for the workflow (pybids_inputs). The function returns a BidsDataset, which we’ll assign to a variable called inputs:

Snakefile#
1from snakebids import bids, generate_inputs
2
3configfile: 'config.yml'
4
5inputs = generate_inputs(
6    bids_dir=config['bids_dir'],
7    pybids_inputs=config['pybids_inputs'],
8)
9

Note

Snakebids has transitioned to a new format for generate_inputs(). To get access to the old dict style return, the use_bids_inputs parameter must be set to False. A tutorial for the old syntax can be found on the v0.5.0 docs.

The config variables we need pre-defined are as follows::

config.yml#
 1bids_dir: 'data'
 2
 3fwhm:
 4  - 5
 5  - 10
 6  - 15
 7  - 20
 8
 9pybids_inputs:
10  bold:
11    filters:
12      suffix: 'bold'
13      extension: '.nii.gz'
14      datatype: 'func'
15    wildcards:
16      - subject
17      - session
18      - acquisition
19      - task
20      - run

The pybids_inputs dict defines what types of inputs the workflow can make use of (i.e. the top-level keys, bold in this case), and for each input, how to filter for them (i.e. the filters dict), and what BIDS entities to replace with wildcards in the snakemake workflow (i.e. the wildcards dict).

Note

The filters dict is passed directly to the get() function in pybids, and thus is quite customizable.

Note

Entries in the wildcards list do not have to be in your bids dataset, but if they are, then they will be converted into wildcards (i.e. task-{task}) if they are in the filenames. The names for these also correspond with pybids (e.g. acquisition maps to acq).

The BidsDataset class returned by generate_inputs() summarizes the wildcards found in your bids dataset and has parameters to plug those wildcards into your workflow. To investigate it, add a print statement following generate_inputs(...):

Snakefile#
 1from snakebids import bids, generate_inputs
 2
 3configfile: 'config.yml'
 4
 5inputs = generate_inputs(
 6    bids_dir=config['bids_dir'],
 7    pybids_inputs=config['pybids_inputs'],
 8)
 9
10print(inputs)
11

Run the workflow:

$ snakemake -nq
BidsDataset({
    "bold": BidsComponent(
        name="bold",
        path="/path/to/data/sub-{subject}/func/sub-{subject}_task-{task}_run-{run}_bold.nii.gz",
        zip_lists={
            "subject": ["001",  "001" ],
            "task":    ["rest", "rest"],
            "run":     ["1",    "2"   ],
        },
    ),
})

As you can see, BidsDataset is just a special kind of dict. Its keys refer to the names of the input types you specified in the config file (in pybids_inputs). You can test this by running print(list(inputs.keys)). Each value contains an object summarizing that input type. We refer to these input types, and the objects that describe them, as BidsComponents.

Each BidsComponent has three primary attributes. .name is the name of the component, this will be the same as the dictionary key in the dataset. .path is the generic path of the component. Note the wildcards: {subject}, {task}, and {run}. These wildcards can be substituted for values that will uniquely define each specific path. .zip_lists contains these unique values. It’s a simple dict whose keys are bids entities and whose values are lists of entity-values. Note the tabular format that printed in your console: each of the columns of this “table” correspond to the entity-values of one specific file.

Notice that inputs['bold'].path is the same as the path we wrote under in_bold: in our config.yaml file in step 4. In fact, we can go ahead and replace config['in_bold'] in our Snakemake file with inputs['bold'].path and delete in_bold from config.yaml.

Snakefile#
32rule smooth:
33    input:
34        inputs['bold'].path
35    params:

Step 7: using input wildcards#

BidsComponent.path already grants us a lot of flexibility, but we can still do more! In addition to the three main attributes of BidsComponents already described, the class offers a number of special properties we can use in our workflows. First, we’ll look at BidsComponent.wildcards. This is a dict that maps each entity to the brace-wrapped {wildcards} we specified in pybids_config. If you printed this value in our test workflow, it would look like this:

inputs['bold'].wildcards == {
    'subject': '{subject}',
    'task': '{task}',
    'run': '{run}'
}

This is super useful when combined with bids(), as we can use the keyword expansion **inputs["<input_name>"].wildcards to set all the wildcard parameters to the bids() function. Thus, we can make our workflow even more general, by replacing this:

Snakefile#
32rule smooth:
33    input:
34        inputs['bold'].path
35    params:
36        sigma = calc_sigma_from_fwhm
37    output:
38        bids(
39            root='results',
40            subject='{subject}',
41            task='{task}',
42            run='{run}',
43            fwhm='{fwhm}',
44            suffix='bold.nii.gz'
45        )
46    shell:
47        'fslmaths {input} -s {params.sigma} {output}'

with this:

Snakefile#
27rule smooth:
28    input:
29        inputs['bold'].path
30    params:
31        sigma = calc_sigma_from_fwhm
32    output:
33        bids(
34            root='results',
35            fwhm='{fwhm}',
36            suffix='bold.nii.gz',
37            **inputs['bold'].wildcards
38        )
39    shell:
40        'fslmaths {input} -s {params.sigma} {output}'

This effectively ensures that any bids entities from the input filenames (that are listed as pybids wildcards) get carried over to the output filenames. Note that we still have the ability to add on additional entities, such as fwhm here, and set the root directory and suffix.

Finally, we can use our BidsComponents to easily expand over the entity values found in our dataset using BidsComponent.expand(). This method gets used instead of the snakemake expand() function:

Snakefile#
10rule all:
11    input:
12        inputs['bold'].expand(
13            bids(
14                root='results',
15                fwhm='{fwhm}',
16                suffix='bold.nii.gz',
17                **inputs['bold'].wildcards
18            ),
19            fwhm=config['fwhm'],
20        )
21
22

Note

BidsComponent.expand() still uses snakemake’s expand() under the hood, but applies extra logic to ensure only entity groups actually found in your dataset are used. If need to expand over additional wildcards, just add them as keyword args. They’ll expand over every possible combination, just like snakemake’s expand().

For reference, here is the updated config file and Snakefile after these changes:

config.yml#
 1bids_dir: 'data'
 2
 3fwhm:
 4  - 5
 5  - 10
 6  - 15
 7  - 20
 8
 9pybids_inputs:
10  bold:
11    filters:
12      suffix: 'bold'
13      extension: '.nii.gz'
14      datatype: 'func'
15    wildcards:
16      - subject
17      - session
18      - acquisition
19      - task
20      - run
Snakefile#
 1from snakebids import bids, generate_inputs
 2
 3configfile: 'config.yml'
 4
 5inputs = generate_inputs(
 6    bids_dir=config['bids_dir'],
 7    pybids_inputs=config['pybids_inputs'],
 8)
 9
10rule all:
11    input:
12        inputs['bold'].expand(
13            bids(
14                root='results',
15                fwhm='{fwhm}',
16                suffix='bold.nii.gz',
17                **inputs['bold'].wildcards
18            ),
19            fwhm=config['fwhm'],
20        )
21
22
23def calc_sigma_from_fwhm(wildcards):
24    return f'{float(wildcards.fwhm)/2.355:0.2f}'
25
26
27rule smooth:
28    input:
29        inputs['bold'].path
30    params:
31        sigma = calc_sigma_from_fwhm
32    output:
33        bids(
34            root='results',
35            fwhm='{fwhm}',
36            suffix='bold.nii.gz',
37            **inputs['bold'].wildcards
38        )
39    shell:
40        'fslmaths {input} -s {params.sigma} {output}'

Step 8: creating a command-line executable#

Now that we have pybids parsing to dynamically configure our workflow inputs based on our BIDS dataset, we are ready to turn our workflow into a BIDS App. BIDS Apps are command-line apps with a standardized interface (e.g. three required positional arguments: bids_directory, output_directory, and analysis_level).

We do this in snakebids by creating a python script containing a bidsapp.app built with the Snakemake integration plugin SnakemakeBidsApp. An example of this run.py script is shown below.

run.py#
 1#!/usr/bin/env python3
 2from pathlib import Path
 3
 4from snakebids import bidsapp, plugins
 5
 6app = bidsapp.app(
 7    [
 8        plugins.SnakemakeBidsApp(Path(__file__).resolve().parent),
 9    ]
10)
11
12if __name__ == "__main__":
13    app.run()

This creates a bidsapp with all the standard arguments. The usage will look something like this:

./run.py INPUT_DATASET OUTPUT_DATASET [participant|group] [--derivatives] [--participant-label LABEL...]

A more complete description of bidsapp usage can be found at Running Snakebids.

Additional argument can be added using the config.yml. For instance, we can turn our fwhm setting into a CLI parameter with the following:

config.yml#
24parse_args:
25  --fwhm:
26    help: >
27      Set the full-width-half-maximum values that should be used for smoothing.
28    type: int
29    nargs: +
30    default:
31      - 5
32      - 10
33      - 15
34      - 20

Snakebids uses the argparse module, and each entry in this parse_args dict becomes a call to add_argument() from argparse.ArgumentParser. When you run the workflow, snakebids adds the named argument values to the config dict, so your workflow can make use of it as if you had manually added the variable to your configfile. See parse_args for more details.

In the above example, snakebids will automatically insert a key called fwhm into the snakemake config containing the values provided from the command line (or the default, if no values are provided).

The analysis_level positional argument can also be modified in the config. The available levels are in an analysis_levels list in the config. Specific snakemake targets can be mapped to these levels in the targets_by_analysis_level dict:

config.yml#
17analysis_levels:
18 - participant
19
20targets_by_analysis_level:
21  participant:
22    - ''  # if '', then the first rule is run

Note: since we specified a '' for the target rule, no target rule will be specified, so snakemake will just default to the first rule in the workflow.

Make the run.py script executable (chmod a+x run.py) and try running it now.

The updated configfile is here (the Snakefile did not change in this step):

config.yml#
 1bids_dir: 'data'
 2
 3
 4pybids_inputs:
 5  bold:
 6    filters:
 7      suffix: 'bold'
 8      extension: '.nii.gz'
 9      datatype: 'func'
10    wildcards:
11      - subject
12      - session
13      - acquisition
14      - task
15      - run
16
17analysis_levels:
18 - participant
19
20targets_by_analysis_level:
21  participant:
22    - ''  # if '', then the first rule is run
23
24parse_args:
25  --fwhm:
26    help: >
27      Set the full-width-half-maximum values that should be used for smoothing.
28    type: int
29    nargs: +
30    default:
31      - 5
32      - 10
33      - 15
34      - 20