MNIST¶
The MNIST dataset is a collection of 60,000 handwritten digits widely used for training statistical, Machine Learning (ML) and Deep Learning (DL) models. The MNIST MLCube example demonstrates how data scientists, ML and DL researchers and developers can distribute their ML projects (including training, validation and inference code) as MLCube cubes. MLCube establishes a standard to package user workloads, and provides unified command line interface. In addition, MLCube provides a number of reference runners - python packages that can run cubes on different platforms including docker and singularity.
A data scientist has been working on a machine learning project. The goal is to train a simple neural network to classify collection of 60,000 small images into 10 classes.
MNIST training code¶
Training a ML model is a process involving multiple steps such as getting data, analyzing and cleaning data,
splitting into train/validation/test data sets, running hyper-parameter optimization experiments and performing final
model testing. It is a relatively small and well studied dataset that provides standard train/test split. In this simple
example a developer needs to implement two steps - (1) downloading data and (2) training a model. We'll call these steps
as tasks
. Each task requires several parameters, such as URL of the data set that we need to download, location on a
local disk where the data set will be serialized, path to a directory that will contain training artifacts such as log
files, training snapshots and ML models. We can characterize these two tasks in the following way:
- Data Download
task:
- Inputs: None. We'll assume the download URL is defined in the source code.
- Outputs: Directory to serialize the data set (data_dir
) and directory to serialize log files (log_dir
).
- Training
task:
- Inputs: Directory with MNIST data set (data_dir
), training hyper-parameters defined in a file
(parameters_file
).
- Outputs: Directory to store training results (model_dir
) and directory to store log files (log_dir
).
We have intentionally made all input/output parameters to be file system artifacts. By doing so, we support
reproducibility. Instead of command line arguments that can easily be lost, we store them in files. There are many
different ways to implement the MNIST example. For simplicity, we assume the following:
- We use one python file.
- Task name (download, train) is a command line positional parameter.
- Both tasks write logs, so it makes sense to add parameter accepting directory for log files.
- The download task accepts additional data directory parameter.
- The train task accepts such parameters as data and model directories, path to a file with hyper-parameter.
- Configurable hyper-parameters are: (1) optimizer name, (2) number of training epochs and (3) global batch size.
Then, our implementation could look like this. Parse command line and identify task. If it is download
, call a
function that downloads data sets. If it is train
, train a model. This is sort of single entrypoint implementation
where we run one script asking to perform various tasks. We run our script (mnist.py) in the following way:
python mnist.py download --data_dir=PATH --log_dir=PATH
python mnist.py train --data_dir=PATH --log_dir=PATH --model_dir=PATH --parameters_file=PATH
MLCube implementation¶
Packaging our MNIST training script as a MLCube is done in several steps. We will be using a directory-based
cube where a directory is structured in a certain way and contains specific files that make it MLCube compliant.
We need to create an empty directory on a local disk. Let's assume we call it mnist
and we'll use
{MLCUBE_ROOT}
to denote a full path to this directory. This is called a cube root directory. At this point this
directory is empty:
mnist/
Build location¶
The cube directory has a sub-directory called build
({MLCUBE_ROOT}/build
) that stores project source files,
resources required for training, other files to recreate run time (such as requirements.txt, docker and singularity
recipes etc.). We need to create the build directory and copy two files: mnist.py that implements training and
requirements.txt that lists python dependencies. By doing so, we are enforcing reproducibility. A developer of this
cube wants to make it easier to run their training workload in a great variety of environments including universities,
commercial companies, HPC-friendly organizations such as national labs. One way to achieve it is to use container
runtime such as docker or singularity. So, we'll provide both docker file and singularity recipe that we'll put into
build
directory as well. Thus, we'll make this directory a build context. The cube directory now looks like:
mnist/
build/
mnist.py
requirements.txt
Dockerfile
Singularity.recipe
MLCube definition file¶
At this point we are ready to create a cube definition file. This is the first definition file that makes some folder a
MLCube folder. This is a YAML file that provides information such as name, author, version, named as mlcube.yaml
and located in the cube root directory . The most important section is the one that lists what tasks are implemented in
this cube:
schema_version: 1.0.0 # We use MLSpec library to validate YAML definition files. This is the
schema_type: mlcube_root # specification of the schema that this file must be consistent with.
name: mnist # Name of this cube.
author: MLPerf Best Practices Working Group # A developer of the cube.
version: 0.1.0 # MLCube version.
mlcube_spec_version: 0.1.0 # TODO: What is it?
tasks: # Tasks are defined in external YAML files located in tasks folder.
- 'tasks/download.yaml' # "Download data set" task definition file.
- 'tasks/train.yaml' # "Training a model" task definition file.
mnist/
build/ {mnist.py, requirements.txt, Dockerfile, Singularity.recipe}
mlcube.yaml
Task definition file¶
The cube definition file references two tasks defined in the tasks
subdirectory. Each YAML file there defines a
task supported by the cube. Task files are named the same as tasks. We need to create a tasks directory and two files
inside that directory - download.yaml
and train.yaml
.
Each task file defines input and output specifications for each task. The download task (download.yaml) is defined:
schema_version: 1.0.0 # Task schema definition. Leave this two fields as is.
schema_type: mlcube_task
inputs: [] # Since this task does not have any inputs, the section is empty.
outputs: # This task produces two artifacts - downloaded data and log files.
- name: data_dir # This parameter accepts path to a directory where data set will be serialized.
type: directory # We implicitly specify that this is a directory
- name: log_dir # This parameter accepts path to a directory with log files this task writes.
type: directory # We implicitly specify that this is a directory
python mnist.py download --data_dir=PATH --log_dir=PATH
The train task (train.yaml
) is defined in the following way:
schema_version: 1.0.0 # Task schema definition. Leave this two fields as is.
schema_type: mlcube_task
inputs: # These are the task inputs.
- name: data_dir # This parameter accepts path to a directory where data set will be serialized.
type: directory # We implicitly specify that this is a directory
- name: parameters_file # A file containing training hyper-parameters.
type: file # This is a file.
outputs: # These are the task outputs.
- name: log_dir # This parameter accepts path to a directory with log files this task writes.
type: directory # We implicitly specify that this is a directory
- name: model_dir # Path to a directory where training artifacts are stored.
type: directory # We implicitly specify that this is a directory
python mnist.py train --data_dir=PATH --log_dir=PATH --model_dir=PATH --parameters_file=PATH
mnist/
build/ {mnist.py, requirements.txt, Dockerfile, Singularity.recipe}
tasks/ {download.yaml, train.yaml}
mlcube.yaml
Workspace¶
The workspace is a directory inside cube (workspace
) where, by default, input/output file system artifacts are
stored. The are multiple reasons to have one. One is to formally have default place for data sets, configuration
and log files etc. Having all these parameters in one place makes it simpler to run cubes on remote hosts and then
sync results back to users' local machines.
We need to be able to provide collection of hyper-parameters and formally define a directory to store logs, models and
MNIST data set. To do so, we create the directory tree workspace/parameters
, and then create a file
(default.parameters.yaml
) with the following content:
optimizer: "adam"
train_epochs: 5
batch_size: 32
mnist/
build/ {mnist.py, requirements.txt, Dockerfile, Singularity.recipe}
tasks/ {download.yaml, train.yaml}
workspace/
parameters/
default.parameters.yaml
mlcube.yaml
Run configurations¶
The MLCube definition file (mlcube.yaml
) provides paths to task definition files that formally define tasks
input/output parameters. A run configuration assigns values to task parameters. One reason to define and "implement"
parameters in different files is to be able to provide multiple configurations for the same task. One example could be
one-GPU training configuration and 8-GPU training configuration. Since we have two tasks - download and train - we need
to define at least two run configurations. Run configurations are defined in the run
subdirectory.
Run configuration for the download task looks like:
schema_type: mlcube_invoke # Run (invoke) schema definition. Leave this two fields as is.
schema_version: 1.0.0
task_name: download # Task name
input_binding: {} # No input parameters for this task.
output_binding: # Output parameters, format is "parameter: value"
data_dir: $WORKSPACE/data # Path to serialize downloaded MNIST data set
log_dir: $WORKSPACE/download_logs # Path to log files.
$WORKSPACE
token is replaced with actual path to the cube workspace. File system paths are relative to the
workspace directory. This makes it possible to provide absolute paths for cases when data sets are stored on shared
drives. Run configuration for the train task looks like:
schema_type: mlcube_invoke # Run (invoke) schema definition. Leave this two fields as is.
schema_version: 1.0.0
task_name: train # Task name
input_binding: # Input parameters (name: value)
data_dir: $WORKSPACE/data
parameters_file: $WORKSPACE/parameters/default.parameters.yaml
output_binding: # Output parameters (name: value)
log_dir: $WORKSPACE/train_logs
model_dir: $WORKSPACE/model
mnist/
build/ {mnist.py, requirements.txt, Dockerfile, Singularity.recipe}
tasks/ {download.yaml, train.yaml}
workspace/parameters/default.parameters.yaml
run/
download.yaml
train.yaml
mlcube.yaml
Platform configurations¶
Platform configurations define how MLCube cubes run. Docker, Singularity, SSH and cloud runners have their own
configurations. For instance, Docker platform configuration at minimum provides image name and docker executable
(docker / nvidia-docker). SSH platform configuration could provide IP address of a remote host, login credentials etc.
Platform configurations are supposed to be used by runners, and each runner has its own platform schema. The Runners
documentation section provides detailed description of reference runners together with platform configuration schemas.
Since we wanted to support Docker and Singularity runtimes, we provide docker.yaml
and singularity.yaml
files in
the platforms
subdirectory that is default location to store these types of files. Docker platform configuration is
the following:
schema_version: 1.0.0
schema_type: mlcube_docker
image: mlcommons/mlcube:mnist # Docker image name
docker_runtime: docker # Docker executable: docker or nvidia-docker
Singularity platform configuration is the following:
schema_version: 1.0.0
schema_type: mlcube_singularity
image: /opt/singularity/mlcommons_mlcube_mnist-0.01.simg # Path to or name of a Singularity image.
mnist/
build/ {mnist.py, requirements.txt, Dockerfile, Singularity.recipe}
tasks/ {download.yaml, train.yaml}
workspace/parameters/default.parameters.yaml
run/ {download.yaml, train.yaml}
platforms/
docker.yaml
singularity.yaml
mlcube.yaml
MNIST MLCube directory structure summary¶
mnist/ # MLCube root directory.
build/ # Project source code, resource files, Docker/Singularity recipes.
mnist.py # Python source code training simple neural network using MNIST data set.
requirements.txt # Python project dependencies.
Dockerfile # Docker recipe.
Singularity.recipe # Singularity recipe.
tasks/ # Task definition files - define functionality that MLCube supports
download.yaml # Download MNIST data set.
train.yaml # Train neural network.
workspace/ # Default location for data sets, logs, models, parameter files.
parameters/ # Model hyper-parameters can be stored at any location.
default.parameters.yaml # This is just what is used in this implementation.
run/ # Run configurations - bind task parameters and values.
download.yaml # Concrete run specification for the download task.
train.yaml # Concrete run specification for the train task.
platforms/ # Platform definition files - define how MLCube runs.
docker.yaml # Docker runtime definition.
singularity.yaml # Singularity runtime definition.
mlcube.yaml # MLCube definition file.
Running MNIST MLCube¶
We need to setup the Python virtual environment. These are the steps outlined in the Introduction
section except we do
not clone GitHub repository with the example MLCube cubes.
# Create Python Virtual Environment
virtualenv -p python3 ./env && source ./env/bin/activate
# Install MLCube Docker and Singularity runners
pip install mlcube-docker mlcube-singularity
# Optionally, setup host environment by providing the correct `http_proxy` and `https_proxy` environmental variables.
# export http_proxy=...
# export https_proxy=..
Before running MNIST cube below, it is probably a good idea to remove tasks' outputs from previous runs that are located in the
workspace
directory. All directories exceptparameters
can be removed.
Docker Runner¶
Configure MNIST cube (this is optional step, docker runner checks if image exists, and if does not, runs configure
phase automatically):
mlcube_docker configure --mlcube=. --platform=platforms/docker.yaml
Run two tasks - download
(download data) and train
(train tiny neural network):
mlcube_docker run --mlcube=. --platform=platforms/docker.yaml --task=run/download.yaml
mlcube_docker run --mlcube=. --platform=platforms/docker.yaml --task=run/train.yaml
Singularity Runner¶
Update path to store Singularity image. Open platforms/singularity.yaml
and update the image
value
that is set by default to /opt/singularity/mlcommons_mlcube_mnist-0.01.simg
(relative paths are supported, they are
relative to workspace
).
Configure MNIST cube:
mlcube_singularity configure --mlcube=. --platform=platforms/singularity.yaml
Run two tasks - download
(download data) and train
(train tiny neural network):
mlcube_singularity run --mlcube=. --platform=platforms/singularity.yaml --task=run/download.yaml
mlcube_singularity run --mlcube=. --platform=platforms/singularity.yaml --task=run/train.yaml