For any DL pipeline, the following flow needs to be performed:
A detailed data flow diagram is presented in this link.
GaNDLF addresses all of these, and the information is divided as described in the following section.
Please follow the installation instructions to install GaNDLF. When the installation is complete, you should end up with the shell that looks like the following, which indicates that the GaNDLF virtual environment has been activated:
(venv_gandlf) $> ### subsequent commands go here
A major reason why one would want to anonymize data is to ensure that trained models do not inadvertently do not encode protect health information [1,2]. GaNDLF can anonymize single images or a collection of images using the gandlf_anonymizer
script. It can be used as follows:
# continue from previous shell
(venv_gandlf) $> python gandlf_anonymizer
# -h, --help show help message and exit
-c ./samples/config_anonymizer.yaml \ # anonymizer configuration - needs to be a valid YAML (check syntax using https://yamlchecker.com/)
-i ./input_dir_or_file \ # input directory containing series of images to anonymize or a single image
-o ./output_dir_or_file # output directory to save anonymized images or a single output image file
It is highly recommended that the dataset you want to train/infer on has been harmonized. The following requirements should be considered:
Recommended tools for tackling all aforementioned curation and annotation tasks:
GaNDLF can be used to convert a Whole Slide Image (WSI) with or without a corresponding label map to patches/tiles using GaNDLF’s integrated patch miner, which would need the following files:
scale
: scale at which operations such as tissue mask calculation happens; defaults to 16
.patch_size
: defines the size of the patches to extract, should be a tuple type of integers (e.g., [256,256]
) or a string containing patch size in microns (e.g., [100m,100m]
).num_patches
: defines the number of patches to extract; use -1
to mine until exhaustion.value_map
: mapping RGB values in label image to integer values for training; defaults to None
.read_type
: either random
or sequential
(latter is more efficient); defaults to random
.overlap_factor
: Portion of patches that are allowed to overlap (0->1
); defaults to 0.0
.num_workers
: number of workers (note that this does not scale according to the number of threads available on your machine) to use for patch extraction; defaults to 1
.SubjectID
: the ID of the subject for the WSIChannel_0
: the WSI fileLabel
: (optional) the label map fileOnce these files are present, the patch miner can be run using the following command:
# continue from previous shell
(venv_gandlf) $> python gandlf_patchMiner \
# -h, --help show help message and exit
-c ./exp_patchMiner/config.yaml \ # patch extraction configuration - needs to be a valid YAML (check syntax using https://yamlchecker.com/)
-i ./exp_patchMiner/input.csv \ # data in CSV format
-o ./exp_patchMiner/output_dir/ # output directory
Running preprocessing before training/inference is optional, but recommended. It will significantly reduce the computational footprint during training/inference at the expense of larger storage requirements. To run preprocessing before training/inference you can use the following command, which will save the processed data in ./experiment_0/output_dir/
with a new data CSV and the corresponding model configuration:
# continue from previous shell
(venv_gandlf) $> python gandlf_preprocess \
# -h, --help show help message and exit
-c ./experiment_0/model.yaml \ # model configuration - needs to be a valid YAML (check syntax using https://yamlchecker.com/)
-i ./experiment_0/train.csv \ # data in CSV format
-o ./experiment_0/output_dir/ # output directory
This application can leverage multiple channels/modalities for training while using a multi-class segmentation file. The expected format is shown as an example in samples/sample_train.csv and needs to be structured with the following header format (which shows a CSV with N
subjects, each having X
channels/modalities that need to be processed):
SubjectID,Channel_0,Channel_1,...,Channel_X,Label
001,/full/path/001/0.nii.gz,/full/path/001/1.nii.gz,...,/full/path/001/X.nii.gz,/full/path/001/segmentation.nii.gz
002,/full/path/002/0.nii.gz,/full/path/002/1.nii.gz,...,/full/path/002/X.nii.gz,/full/path/002/segmentation.nii.gz
...
N,/full/path/N/0.nii.gz,/full/path/N/1.nii.gz,...,/full/path/N/X.nii.gz,/full/path/N/segmentation.nii.gz
Channel
can be substituted with Modality
or Image
Label
can be substituted with Mask
or Segmentation
and is used to specify the annotation file for segmentation modelsValueToPredict
. Currently, we are supporting only a single value prediction per model.Label
or ValueToPredict
header should be passed
gandlf_constructCSV
applicationTo make the process of creating the CSV easier, we have provided a utility application called gandlf_constructCSV
. This script works when the data is arranged in the following format (example shown of the data directory arrangement from the Brain Tumor Segmentation (BraTS) Challenge):
$DATA_DIRECTORY
│
└───Patient_001 # this is constructed from the ${PatientID} header of CSV
│ │ Patient_001_brain_t1.nii.gz
│ │ Patient_001_brain_t1ce.nii.gz
│ │ Patient_001_brain_t2.nii.gz
│ │ Patient_001_brain_flair.nii.gz
│ │ Patient_001_seg.nii.gz # optional for segmentation tasks
│
└───Patient_002 # this is constructed from the ${PatientID} header of CSV
│ │ Patient_002_brain_t1.nii.gz
│ │ Patient_002_brain_t1ce.nii.gz
│ │ Patient_002_brain_t2.nii.gz
│ │ Patient_002_brain_flair.nii.gz
│ │ Patient_002_seg.nii.gz # optional for segmentation tasks
│
└───JaneDoe # this is constructed from the ${PatientID} header of CSV
│ │ randomFileName_0_t1.nii.gz # the string identifier needs to be the same for each modality
│ │ randomFileName_1_t1ce.nii.gz
│ │ randomFileName_2_t2.nii.gz
│ │ randomFileName_3_flair.nii.gz
│ │ randomFileName_seg.nii.gz # optional for segmentation tasks
│
...
The following command shows how the script works:
# continue from previous shell
(venv_gandlf) $> python gandlf_constructCSV \
# -h, --help show help message and exit
-i $DATA_DIRECTORY # this is the main data directory
-c _t1.nii.gz,_t1ce.nii.gz,_t2.nii.gz,_flair.nii.gz \ # an example image identifier for 4 structural brain MR sequences for BraTS, and can be changed based on your data
-l _seg.nii.gz \ # an example label identifier - not needed for regression/classification, and can be changed based on your data
-o ./experiment_0/train_data.csv # output CSV to be used for training
Notes:
ValueToPredict
. Currently, we are supporting only a single value prediction per model.SubjectID
or PatientName
is used to ensure that the randomized split is done per-subject rather than per-image.GaNDLF requires a YAML-based configuration that controls various aspects of the training/inference process. There are multiple samples for users to start as their baseline for further customization. A list of the available samples is presented as follows:
Notes:
gandlf_configGenerator
script can be used to generate a grid of configurations for tuning the hyperparameters of a baseline configuration that works for your dataset and problem.1
epoch for your dataset and problem at hand (regardless of the efficacy).# continue from previous shell
(venv_gandlf) $> python gandlf_configGenerator \
# -h, --help show help message and exit
-c ./samples/config_all_options.yaml \ # baseline configuration
-s ./samples/config_generator_strategy.yaml \ # strategy file
-o ./all_experiments/ # output directory
4
configurations that leverage unet
and resunet
architectures for learning rates of [0.1,0.01]
, you can use the following strategy file:
model:
{
architecture: [unet, resunet],
}
learning_rate: [0.1, 0.01]
You can use the following code snippet to run GaNDLF:
# continue from previous shell
(venv_gandlf) $> python gandlf_run \
## -h, --help show help message and exit
## -v, --version Show program's version number and exit.
-c ./experiment_0/model.yaml \ # model configuration - needs to be a valid YAML (check syntax using https://yamlchecker.com/)
-i ./experiment_0/train.csv \ # data in CSV format
-m ./experiment_0/model_dir/ \ # model directory (i.e., the `modeldir`) where the output of the training will be stored, created if not present
-t True \ # True == train, False == inference
-d cuda # ensure CUDA_VISIBLE_DEVICES env variable is set for GPU device, use 'cpu' for CPU workloads
# -rt , --reset # [optional] completely resets the previous run by deleting `modeldir`
# -rm , --resume # [optional] resume previous training by only keeping model dict in `modeldir`
modality
key in the configuration to rad
. This will ensure the histology-specific pipelines are not triggered.modality
should be kept as histo
.GaNDLF enables relatively straightforward multi-GPU training. Simply set the CUDA_VISIBLE_DEVICES
environment variable to the list of GPUs you want to use, and pass cuda
as the device to the gandlf_run
script. For example, if you want to use GPUs 0, 1, and 2, you would set CUDA_VISIBLE_DEVICES=0,1,2
[ref] and pass -d cuda
to the gandlf_run
script.
Distributed training is a more difficult problem to address, since there are multiple ways to configure a high-performance computing cluster (SLURM, OpenHPC, Kubernetes, and so on). Owing to this discrepancy, we have ensured that GaNDLF allows multiple training jobs to be submitted in relatively straightforward manner using the command line inference of each site’s configuration. Simply populate the parallel_compute_command
in the configuration with the specific command to run before the training job, and GaNDLF will use this string to submit the training job.
Once your model is trained, you should see the following output:
# continue from previous shell
(venv_gandlf) $> ls ./experiment_0/model_dir/
data_${cohort_type}.csv # data CSV used for the different cohorts, which can be either training/validation/testing
data_${cohort_type}.pkl # same as above, but in pickle format
logs_${cohort_type}.csv # logs for the different cohorts that contain the various metrics, which can be either training/validation/testing
${architecture_name}_best.pth.tar # the best model in native PyTorch format
${architecture_name}_latest.pth.tar # the latest model in native PyTorch format
${architecture_name}_initial.pth.tar # the initial model in native PyTorch format
${architecture_name}_initial.{onnx/xml/bin} # [optional] if ${architecture_name} is supported, the graph-optimized best model in ONNX format
# other files dependent on if training/validation/testing output was enabled in configuration
outputdir
is not passed to gandlf_run
.outputdir
or modeldir
as a CSV file.After the testing/validation training is finished, GaNDLF enables the collection of all the statistics from the final models for testing and validation datasets and plot them. The gandlf_collectStats can be used for plotting:
# continue from previous shell
(venv_gandlf) $> python gandlf_collectStats \
-m /path/to/trained/models \ # directory which contains testing and validation models
-o ./experiment_0/output_dir_stats/ # output directory to save stats and plot
The integration of the M3D-CAM library into GaNDLF enables the generation of attention maps for 3D/2D images in the validation epoch for classification and segmentation tasks. To activate M3D-CAM you just need to add the following parameter to the config:
medcam:
{
backend: "gcam",
layer: "auto"
}
You can choose one of the following backends:
gcam
)gbp
)ggcam
)gcampp
)Optionally one can also change the name of the layer for which the attention maps should be generated. The default behavior is auto
which chooses the last convolutional layer.
All generated attention maps can be found in the experiment’s output directory. Link to the original repository: github.com/MECLabTUDA/M3d-Cam
If you have a model previously trained using GaNDLF that you wish to run graph optimizations on, you can use the gandlf_optimize
script to do so. The following command shows how it works:
# continue from previous shell
(venv_gandlf) $> python gandlf_optimizeModel \
-m /path/to/trained/${architecture_name}_best.pth.tar \ # directory which contains testing and validation models
-c ./experiment_0/config_used_to_train.yaml # the config file used to train the model
If ${architecture_name}
is supported, the optimized model will get generated in the model directory, with the name ${architecture_name}_optimized.onnx
.
GaNDLF provides the ability to deploy models into easy-to-share, easy-to-use formats – users of your model do not even need to install GaNDLF. Currently, Docker images are supported (which can be converted to Apptainer/Singularity format). These images meet the MLCube interface. This allows your algorithm to be used in a consistent manner with other machine learning tools.
The resulting image contains your specific version of GaNDLF (including any custom changes you have made) and your trained model and configuration. This ensures that upstream changes to GaNDLF will not break compatibility with your model.
To deploy a model, simply run the gandlf_deploy
command after training a model. You will need the Docker engine installed to build Docker images. This will create the image and, for MLCubes, generate an MLCube directory complete with an mlcube.yaml
specifications file, along with the workspace directory copied from a pre-existing template.
# continue from previous shell
(venv_gandlf) $> python gandlf_deploy \
## -h, --help show help message and exit
-c ./experiment_0/model.yaml \ # Configuration to bundle with the model (you can recover it with gandlf_recoverConfig first if needed)
-m ./experiment_0/model_dir/ \ # model directory (i.e., modeldir)
--target docker \ # the target platform (--help will show all available targets)
--mlcube-root ./my_new_mlcube_dir \ # Directory containing mlcube.yaml (used to configure your image base)
-o ./output_dir # Output directory where a new mlcube.yaml file to be distributed with your image will be created
The usage of GaNDLF remains generally the same even from Docker, but there are a few extra considerations.
Once you have pulled the GaNDLF image, it will have a tag, such as cbica/gandlf:latest-cpu
. Run the following command to list your images and ensure GaNDLF is present:
docker image ls
You can invoke docker run
with the appropriate tag to run GaNDLF:
docker run -it --rm --name gandlf cbica/gandlf:latest-cpu ${gandlf command and parameters go here!}
Remember that arguments/options for Docker itself go before the image tag, while the command and arguments for GaNDLF go after the image tag. For more details and options, see the Docker run documentation.
However, most commands that require files or directories as input or output will fail, because the container, by default, cannot read or write files on your machine for security considerations. In order to fix this, you need to mount specific locations in the filesystem.
The container is basically a filesystem of its own. To make your data available to the container, you will need to mount in files and directories. Generally, it is useful to mount at least input directory (as read-only) and an output directory. See the Docker bind mount instructions for more information.
For example, you might run:
docker run -it --rm --name gandlf --volume /home/researcher/gandlf_input:/input:ro --volume /home/researcher/gandlf_output:/output cbica/gandlf:latest-cpu [command and args go here]
Remember that the process running in the container only considers the filesystem inside the container, which is structured differently from that of your host machine. Therefore, you will need to give paths relative to the mount point destination. Additionally, any paths used internally by GaNDLF will refer to locations inside the container. This means that data CSVs produced by the gandlf_constructCSV
script will need to be made from the container and with input in the same locations. Expanding on our last example:
docker run -it --rm --name dataprep \
--volume /home/researcher/gandlf_input:/input:ro \ # input data is mounted as read-only
--volume /home/researcher/gandlf_output:/output \ # output data is mounted as read-write
cbica/gandlf:latest-cpu \ # change to appropriate docker image tag
gandlf_constructCSV \ # standard construct CSV API starts
--inputDir /input/data \
--outputFile /output/data.csv \
--channelsID _t1.nii.gz \
--labelID _seg.nii.gz
The previous command will generate a data CSV file that you can safely edit outside the container (such as by adding a ValueToPredict
column). Then, you can refer to the same file when running again:
docker run -it --rm --name training \
--volume /home/researcher/gandlf_input:/input:ro \ # input data is mounted as read-only
--volume /home/researcher/gandlf_output:/output \ # output data is mounted as read-write
cbica/gandlf:latest-cpu \ # change to appropriate docker image tag
gandlf_run --train True \ # standard training API starts
--config /input/config.yml \
--inputdata /output/data.csv \
--modeldir /output/model
Considering that you want to train on an existing model that is inside the GaNDLF container (such as in an MLCube container created by gandlf_deploy
), the output will be to a location embedded inside the container. Since you cannot mount something into that spot without overwriting the model, you can instead use the built-in docker cp
command to extract the model afterward. For example, you can fine-tune a model on your own data using the following commands as a starting point:
# Run training on your new data
docker run --name gandlf_training mlcommons/gandlf-pretrained:0.0.1 -v /my/input/data:/input gandlf_run -m /embedded_model/ [...] # Do not include "--rm" option!
# Copy the finetuned model out of the container, to a location on the host
docker cp gandlf_training:/embedded_model /home/researcher/extracted_model
# Now you can remove the container to clean up
docker rm -f gandlf_training
Some special arguments need to be passed to Docker to enable it to use your GPU. With Docker version > 19.03 You can use docker run --gpus all
to expose all GPUs to the container. See the NVIDIA Docker documentation for more details.
If using CUDA, GaNDLF also expects the environment variable CUDA_VISIBLE_DEVICES
to be set. To use the same settings as your host machine, simply add -e CUDA_VISIBLE_DEVICES
to your docker run command. For example:
For example:
docker run --gpus all -e CUDA_VISIBLE_DEVICES -it --rm --name gandlf cbica/gandlf:latest-cuda113 gandlf_run --device cuda [...]
This can be replicated for ROCm for AMD , by following the instructions to set up the ROCm Container Toolkit.
GaNDLF, and GaNDLF-created models, may be distributed as an MLCube. This involves distributing an mlcube.yaml
file. That file can be specified when using the MLCube runners. The runner will perform many aspects of configuring your container for you. Currently, only the mlcube_docker
runner is supported.
See the MLCube documentation for more details.