For any DL pipeline, the following flow needs to be performed:

  1. Data preparation
  2. Split data into training, validation, and testing
  3. Customize the training parameters

A detailed data flow diagram is presented in this link.

GaNDLF addresses all of these, and the information is divided as described in the following section.


Please follow the installation instructions to install GaNDLF. When the installation is complete, you should end up with the shell that looks like the following, which indicates that the GaNDLF virtual environment has been activated:

(venv_gandlf) $> ### subsequent commands go here

Back To Top ↑

Preparing the Data

Anonymize Data

A major reason why one would want to anonymize data is to ensure that trained models do not inadvertently do not encode protect health information [1,2]. GaNDLF can anonymize single images or a collection of images using the gandlf_anonymizer script. It can be used as follows:

# continue from previous shell
(venv_gandlf) $> python gandlf_anonymizer
  # -h, --help         show help message and exit
  -c ./samples/config_anonymizer.yaml \ # anonymizer configuration - needs to be a valid YAML (check syntax using
  -i ./input_dir_or_file \ # input directory containing series of images to anonymize or a single image
  -o ./output_dir_or_file # output directory to save anonymized images or a single output image file

Back To Top ↑

Cleanup/Harmonize/Curate Data

It is highly recommended that the dataset you want to train/infer on has been harmonized. The following requirements should be considered:

Recommended tools for tackling all aforementioned curation and annotation tasks:

Back To Top ↑

Offline Patch Extraction (for histology images only)

GaNDLF can be used to convert a Whole Slide Image (WSI) with or without a corresponding label map to patches/tiles using GaNDLF’s integrated patch miner, which would need the following files:

  1. A configuration file that dictates how the patches/tiles will be extracted. A sample configuration to extract patches is presented here. The options that the can be defined in the configuration are as follows:
    • scale: scale at which operations such as tissue mask calculation happens; defaults to 16.
    • patch_size: defines the size of the patches to extract, should be a tuple type of integers (e.g., [256,256]) or a string containing patch size in microns (e.g., [100m,100m]).
    • num_patches: defines the number of patches to extract; use -1 to mine until exhaustion.
    • value_map: mapping RGB values in label image to integer values for training; defaults to None.
    • read_type: either random or sequential (latter is more efficient); defaults to random.
    • overlap_factor: Portion of patches that are allowed to overlap (0->1); defaults to 0.0.
    • num_workers: number of workers (note that this does not scale according to the number of threads available on your machine) to use for patch extraction; defaults to 1.
  2. A CSV file with the following columns:
    • SubjectID: the ID of the subject for the WSI
    • Channel_0: the WSI file
    • Label: (optional) the label map file

Once these files are present, the patch miner can be run using the following command:

# continue from previous shell
(venv_gandlf) $> python gandlf_patchMiner \ 
  # -h, --help         show help message and exit
  -c ./exp_patchMiner/config.yaml \ # patch extraction configuration - needs to be a valid YAML (check syntax using
  -i ./exp_patchMiner/input.csv \ # data in CSV format 
  -o ./exp_patchMiner/output_dir/ # output directory

Back To Top ↑

Running preprocessing before training/inference (optional)

Running preprocessing before training/inference is optional, but recommended. It will significantly reduce the computational footprint during training/inference at the expense of larger storage requirements. To run preprocessing before training/inference you can use the following command, which will save the processed data in ./experiment_0/output_dir/ with a new data CSV and the corresponding model configuration:

# continue from previous shell
(venv_gandlf) $> python gandlf_preprocess \
  # -h, --help         show help message and exit
  -c ./experiment_0/model.yaml \ # model configuration - needs to be a valid YAML (check syntax using
  -i ./experiment_0/train.csv \ # data in CSV format 
  -o ./experiment_0/output_dir/ # output directory

Back To Top ↑

Constructing the Data CSV

This application can leverage multiple channels/modalities for training while using a multi-class segmentation file. The expected format is shown as an example in samples/sample_train.csv and needs to be structured with the following header format (which shows a CSV with N subjects, each having X channels/modalities that need to be processed):


Using the gandlf_constructCSV application

To make the process of creating the CSV easier, we have provided a utility application called gandlf_constructCSV. This script works when the data is arranged in the following format (example shown of the data directory arrangement from the Brain Tumor Segmentation (BraTS) Challenge):

└───Patient_001 # this is constructed from the ${PatientID} header of CSV
│   │ Patient_001_brain_t1.nii.gz
│   │ Patient_001_brain_t1ce.nii.gz
│   │ Patient_001_brain_t2.nii.gz
│   │ Patient_001_brain_flair.nii.gz
│   │ Patient_001_seg.nii.gz # optional for segmentation tasks
└───Patient_002 # this is constructed from the ${PatientID} header of CSV
│   │ Patient_002_brain_t1.nii.gz
│   │ Patient_002_brain_t1ce.nii.gz
│   │ Patient_002_brain_t2.nii.gz
│   │ Patient_002_brain_flair.nii.gz
│   │ Patient_002_seg.nii.gz # optional for segmentation tasks
└───JaneDoe # this is constructed from the ${PatientID} header of CSV
│   │ randomFileName_0_t1.nii.gz # the string identifier needs to be the same for each modality
│   │ randomFileName_1_t1ce.nii.gz
│   │ randomFileName_2_t2.nii.gz
│   │ randomFileName_3_flair.nii.gz
│   │ randomFileName_seg.nii.gz # optional for segmentation tasks

The following command shows how the script works:

# continue from previous shell
(venv_gandlf) $> python gandlf_constructCSV \
  # -h, --help         show help message and exit
  -i $DATA_DIRECTORY # this is the main data directory 
  -c _t1.nii.gz,_t1ce.nii.gz,_t2.nii.gz,_flair.nii.gz \ # an example image identifier for 4 structural brain MR sequences for BraTS, and can be changed based on your data
  -l _seg.nii.gz \ # an example label identifier - not needed for regression/classification, and can be changed based on your data
  -o ./experiment_0/train_data.csv # output CSV to be used for training


Back To Top ↑

Customize the Training

GaNDLF requires a YAML-based configuration that controls various aspects of the training/inference process. There are multiple samples for users to start as their baseline for further customization. A list of the available samples is presented as follows:


Back To Top ↑

Running multiple experiments (optional)

  1. The gandlf_configGenerator script can be used to generate a grid of configurations for tuning the hyperparameters of a baseline configuration that works for your dataset and problem.
  2. Use a strategy file (example is shown in samples/config_generator_strategy.yaml.
  3. Provide the baseline configuration which has enabled you to successfully train a model for 1 epoch for your dataset and problem at hand (regardless of the efficacy).
  4. Run the following command:
# continue from previous shell
(venv_gandlf) $> python gandlf_configGenerator \
  # -h, --help         show help message and exit
  -c ./samples/config_all_options.yaml \ # baseline configuration
  -s ./samples/config_generator_strategy.yaml \ # strategy file
  -o ./all_experiments/ # output directory
  1. For example, to generate 4 configurations that leverage unet and resunet architectures for learning rates of [0.1,0.01], you can use the following strategy file:
     architecture: [unet, resunet],
    learning_rate: [0.1, 0.01]

Back To Top ↑

Running GaNDLF (Training/Inference)

You can use the following code snippet to run GaNDLF:

# continue from previous shell
(venv_gandlf) $> python gandlf_run \
  ## -h, --help         show help message and exit
  ## -v, --version      Show program's version number and exit.
  -c ./experiment_0/model.yaml \ # model configuration - needs to be a valid YAML (check syntax using
  -i ./experiment_0/train.csv \ # data in CSV format 
  -m ./experiment_0/model_dir/ \ # model directory (i.e., the `modeldir`) where the output of the training will be stored, created if not present
  -t True \ # True == train, False == inference
  -d cuda # ensure CUDA_VISIBLE_DEVICES env variable is set for GPU device, use 'cpu' for CPU workloads
  # -rt , --reset # [optional] completely resets the previous run by deleting `modeldir`
  # -rm , --resume # [optional] resume previous training by only keeping model dict in `modeldir`

Special notes for Inference for Histology images

Back To Top ↑

Parallelize the Training

Multi-GPU training

GaNDLF enables relatively straightforward multi-GPU training. Simply set the CUDA_VISIBLE_DEVICES environment variable to the list of GPUs you want to use, and pass cuda as the device to the gandlf_run script. For example, if you want to use GPUs 0, 1, and 2, you would set CUDA_VISIBLE_DEVICES=0,1,2 [ref] and pass -d cuda to the gandlf_run script.

Distributed training

Distributed training is a more difficult problem to address, since there are multiple ways to configure a high-performance computing cluster (SLURM, OpenHPC, Kubernetes, and so on). Owing to this discrepancy, we have ensured that GaNDLF allows multiple training jobs to be submitted in relatively straightforward manner using the command line inference of each site’s configuration. Simply populate the parallel_compute_command in the configuration with the specific command to run before the training job, and GaNDLF will use this string to submit the training job.

Back To Top ↑

Expected Output(s)


Once your model is trained, you should see the following output:

# continue from previous shell
(venv_gandlf) $> ls ./experiment_0/model_dir/
data_${cohort_type}.csv  # data CSV used for the different cohorts, which can be either training/validation/testing
data_${cohort_type}.pkl  # same as above, but in pickle format
logs_${cohort_type}.csv  # logs for the different cohorts that contain the various metrics, which can be either training/validation/testing
${architecture_name}_best.pth.tar # the best model in native PyTorch format
${architecture_name}_latest.pth.tar # the latest model in native PyTorch format
${architecture_name}_initial.pth.tar # the initial model in native PyTorch format
${architecture_name}_initial.{onnx/xml/bin} # [optional] if ${architecture_name} is supported, the graph-optimized best model in ONNX format
# other files dependent on if training/validation/testing output was enabled in configuration


Back To Top ↑

Plot the final results

After the testing/validation training is finished, GaNDLF enables the collection of all the statistics from the final models for testing and validation datasets and plot them. The gandlf_collectStats can be used for plotting:

# continue from previous shell
(venv_gandlf) $> python gandlf_collectStats \
  -m /path/to/trained/models \  # directory which contains testing and validation models
  -o ./experiment_0/output_dir_stats/  # output directory to save stats and plot

Back To Top ↑

M3D-CAM usage

The integration of the M3D-CAM library into GaNDLF enables the generation of attention maps for 3D/2D images in the validation epoch for classification and segmentation tasks. To activate M3D-CAM you just need to add the following parameter to the config:

  backend: "gcam",
  layer: "auto"

You can choose one of the following backends:

Optionally one can also change the name of the layer for which the attention maps should be generated. The default behavior is auto which chooses the last convolutional layer.

All generated attention maps can be found in the experiment’s output directory. Link to the original repository:

Back To Top ↑

Post-Training Model Optimization

If you have a model previously trained using GaNDLF that you wish to run graph optimizations on, you can use the gandlf_optimize script to do so. The following command shows how it works:

# continue from previous shell
(venv_gandlf) $> python gandlf_optimizeModel \
  -m /path/to/trained/${architecture_name}_best.pth.tar \  # directory which contains testing and validation models
  -c ./experiment_0/config_used_to_train.yaml  # the config file used to train the model

If ${architecture_name} is supported, the optimized model will get generated in the model directory, with the name ${architecture_name}_optimized.onnx.

Back To Top ↑


GaNDLF provides the ability to deploy models into easy-to-share, easy-to-use formats – users of your model do not even need to install GaNDLF. Currently, Docker images are supported (which can be converted to Apptainer/Singularity format). These images meet the MLCube interface. This allows your algorithm to be used in a consistent manner with other machine learning tools.

The resulting image contains your specific version of GaNDLF (including any custom changes you have made) and your trained model and configuration. This ensures that upstream changes to GaNDLF will not break compatibility with your model.

To deploy a model, simply run the gandlf_deploy command after training a model. You will need the Docker engine installed to build Docker images. This will create the image and, for MLCubes, generate an MLCube directory complete with an mlcube.yaml specifications file, along with the workspace directory copied from a pre-existing template.

# continue from previous shell
(venv_gandlf) $> python gandlf_deploy \
  ## -h, --help         show help message and exit
  -c ./experiment_0/model.yaml \ # Configuration to bundle with the model (you can recover it with gandlf_recoverConfig first if needed)
  -m ./experiment_0/model_dir/ \ # model directory (i.e., modeldir)
  --target docker \ # the target platform (--help will show all available targets)
  --mlcube-root ./my_new_mlcube_dir \ # Directory containing mlcube.yaml (used to configure your image base)
  -o ./output_dir # Output directory where a  new mlcube.yaml file to be distributed with your image will be created

Back To Top ↑

Running with Docker

The usage of GaNDLF remains generally the same even from Docker, but there are a few extra considerations.

Once you have pulled the GaNDLF image, it will have a tag, such as cbica/gandlf:latest-cpu. Run the following command to list your images and ensure GaNDLF is present:

docker image ls

You can invoke docker run with the appropriate tag to run GaNDLF:

docker run -it --rm --name gandlf cbica/gandlf:latest-cpu ${gandlf command and parameters go here!}

Remember that arguments/options for Docker itself go before the image tag, while the command and arguments for GaNDLF go after the image tag. For more details and options, see the Docker run documentation.

However, most commands that require files or directories as input or output will fail, because the container, by default, cannot read or write files on your machine for security considerations. In order to fix this, you need to mount specific locations in the filesystem.

Back To Top ↑

Mounting Input and Output

The container is basically a filesystem of its own. To make your data available to the container, you will need to mount in files and directories. Generally, it is useful to mount at least input directory (as read-only) and an output directory. See the Docker bind mount instructions for more information.

For example, you might run:

docker run -it --rm --name gandlf --volume /home/researcher/gandlf_input:/input:ro --volume /home/researcher/gandlf_output:/output cbica/gandlf:latest-cpu [command and args go here]

Remember that the process running in the container only considers the filesystem inside the container, which is structured differently from that of your host machine. Therefore, you will need to give paths relative to the mount point destination. Additionally, any paths used internally by GaNDLF will refer to locations inside the container. This means that data CSVs produced by the gandlf_constructCSV script will need to be made from the container and with input in the same locations. Expanding on our last example:

docker run -it --rm --name dataprep \
  --volume /home/researcher/gandlf_input:/input:ro \ # input data is mounted as read-only
  --volume /home/researcher/gandlf_output:/output \ # output data is mounted as read-write
  cbica/gandlf:latest-cpu \ # change to appropriate docker image tag
  gandlf_constructCSV \ # standard construct CSV API starts
  --inputDir /input/data \
  --outputFile /output/data.csv \
  --channelsID _t1.nii.gz \
  --labelID _seg.nii.gz

The previous command will generate a data CSV file that you can safely edit outside the container (such as by adding a ValueToPredict column). Then, you can refer to the same file when running again:

docker run -it --rm --name training \
  --volume /home/researcher/gandlf_input:/input:ro \ # input data is mounted as read-only
  --volume /home/researcher/gandlf_output:/output \ # output data is mounted as read-write
  cbica/gandlf:latest-cpu \ # change to appropriate docker image tag
  gandlf_run --train True \ # standard training API starts
  --config /input/config.yml \
  --inputdata /output/data.csv \
  --modeldir /output/model

Back To Top ↑

Special Case for Training

Considering that you want to train on an existing model that is inside the GaNDLF container (such as in an MLCube container created by gandlf_deploy), the output will be to a location embedded inside the container. Since you cannot mount something into that spot without overwriting the model, you can instead use the built-in docker cp command to extract the model afterward. For example, you can fine-tune a model on your own data using the following commands as a starting point:

# Run training on your new data
docker run --name gandlf_training mlcommons/gandlf-pretrained:0.0.1 -v /my/input/data:/input gandlf_run -m /embedded_model/ [...] # Do not include "--rm" option!
# Copy the finetuned model out of the container, to a location on the host
docker cp gandlf_training:/embedded_model /home/researcher/extracted_model
# Now you can remove the container to clean up
docker rm -f gandlf_training

Enabling GPUs

Some special arguments need to be passed to Docker to enable it to use your GPU. With Docker version > 19.03 You can use docker run --gpus all to expose all GPUs to the container. See the NVIDIA Docker documentation for more details.

If using CUDA, GaNDLF also expects the environment variable CUDA_VISIBLE_DEVICES to be set. To use the same settings as your host machine, simply add -e CUDA_VISIBLE_DEVICES to your docker run command. For example:

For example:

docker run --gpus all -e CUDA_VISIBLE_DEVICES -it --rm --name gandlf cbica/gandlf:latest-cuda113 gandlf_run --device cuda [...]

This can be replicated for ROCm for AMD , by following the instructions to set up the ROCm Container Toolkit.

Back To Top ↑


GaNDLF, and GaNDLF-created models, may be distributed as an MLCube. This involves distributing an mlcube.yaml file. That file can be specified when using the MLCube runners. The runner will perform many aspects of configuring your container for you. Currently, only the mlcube_docker runner is supported.

See the MLCube documentation for more details.

Back To Top ↑