Improved training performance

MALA offers a options to make training available at large scales in reasonable amounts of time. The general training workflow is the same as outlined in the basic section; the advanced features can be activated by setting the appropriate parameters in the Parameters object.

Using a GPU

The simplest way to accelerate any NN training is to use a GPU for training. Training NNs on GPUs is a well established industry practice and yields huge speedups. MALA supports GPU training. You have to activate GPU usage via

parameters = mala.Parameters()
parameters.use_gpu = True

Afterwards, the entire training will be performed on the GPU - given that a GPU is available.

In cooperation with Nvidia, advanced GPU performance optimizations have been implemented into MALA. Namely, you can enable the following options:

parameters.use_gpu = False # True: Use GPU
"""
Multiple workers allow for faster data processing, but require
additional CPU/RAM power. A good setup is e.g. using 4 CPUs attached
to one GPU and setting the num_workers to 4.
"""
parameters.running.num_workers = 0 # set to e.g. 4
"""
MALA supports a faster implementation of the TensorDataSet class
from the torch library. Turning it on will drastically improve
performance.
"""
parameters.data.use_fast_tensor_data_set = False # True: Faster data loading
"""
Likewise, using CUDA graphs improve performance by optimizing GPU
usage. Be careful, this option is only availabe from CUDA 11.0 onwards.
CUDA graphs will be most effective in cases that are latency-limited,
e.g. small models with shorter epoch times.
"""
parameters.running.use_graphs = False # True: Better GPU utilization
"""
Using mixed precision can also improve performance, but only if
the model is large enough.
"""
parameters.running.use_mixed_precision = False # True: Improved performance for large models

These options aim at different performance bottlenecks in the training and have been shown to speed-up training by up to a factor of 5 if used together, while maintaining the same prediction accuracy.

Currently, these options are disabled by default as they are still being tested extensively by the MALA team in production. Yet, activating them is highly recommended!

Advanced training metrics

When monitoring an NN training, one often checks the validation loss, which is directly outputted by MALA. By default, this validation loss gives the mean squared error between LDOS prediction and actual value. From a purely ML point of view, this is fine; however, the correctness of the LDOS itself does not hold much physical virtue. Thus, MALA implements physical validation metrics to be accessed before and after the training routine.

Specifically, when setting

parameters.running.after_training_metric = "band_energy"

the error in the band energy between actual and predicted LDOS will be calculated and printed before and after network training (in meV/atom). This is a much more intuitive metric to judge network performance, since it is easily phyiscally interpretable. If the error is, e.g., 5 meV/atom, one can expect the network to be reasonably accurate; values of, e.g., 100 meV/atom hint at bad model performance. Of course, the final metric for the accuracy should always be the results of the Tester class.

Please make sure to set the relevant LDOS parameters when using this property via

parameters.targets.ldos_gridsize = 11
parameters.targets.ldos_gridspacing_ev = 2.5
parameters.targets.ldos_gridoffset_ev = -5

Checkpointing a training run

NN training can take a long time, and on HPC systems, where they are usually performed, there exist time limitations for calculations. Thus, it is often necessary to checkpoint a training run and resume it at a later point. MALA provides functionality for this, as shown in the example advanced/ex01_checkpoint_training.py. To use checkpointing, enable the feature in the Parameters object:

parameters.running.checkpoints_each_epoch = 5
parameters.running.checkpoint_name = "ex01_checkpoint"

Simply set an interval for checkpointing and a name for the checkpoint and the training will automatically be checkpointed. Automatic resumption from a checkpoint can trivially be implemented via

if mala.Trainer.run_exists("ex01_checkpoint"):
    parameters, network, datahandler, trainer = \
        mala.Trainer.load_run("ex01_checkpoint")
else:
    parameters, network, datahandler, trainer = initial_setup()

Where initial_setup() encapsulates the training run setup.

Using lazy loading

Lazy loading was already briefly mentioned during the testing of a network. To recap, the idea of lazy loading is to incrementally load data into memory so as to save on RAM usage in cases where large amounts of data are involved. To use lazy loading, enable it by:

parameters.data.use_lazy_loading = True

MALA lazy loading operates snapshot wise - that means if lazy loading is enabled, one snapshot at a time is loaded into memory, processed, unloaded, and the next one is selected. Thus, lazy loading will adversely affect performance. One way to mitigate this is to use multiple CPUs to load and prepare data, i.e., while one CPU is busy processing data/offloading it to GPU, another CPU can already load the next snapshot into memory. To use this so called “prefetching” feature, enable the corresponding parameter via

parameters.data.use_lazy_loading_prefetch = True

Please note that in order to use this feature, you have to assign enough CPUs and memory to your calculation.

Apart from performance, there is an accuracy drawback when employing lazy loading. It is well known that ML algorithms perform optimal when individual training data points are accessed individually. This, however, is not naively possible when using lazy loading - since data is not loaded into memory completely at one point, data cannot be easily randomized. This can impact accuracy very negatively for complicated data sets, as briefly discussed in the MALA publication on temperature transferability of ML-DFT models.

To circumvent this problem, MALA provides functionality to shuffle data from multiple atomic snapshots in snapshot-like files, which can then be used with lazy loading, guaranteeing randomized access to individual data points. Currently, this method requires additional disk space, since the randomized data sets have to be saved - in-memory implementations are currently developed. To use the data shuffling (also shown in example advanced/ex02_shuffle_data.py), you can use the DataShuffler class.

The syntax is very easy, you create a DataShufller object, which provides the same add_snapshot functionalities as the DataHandler object, and shuffle the data once you have added all snapshots in question, i.e.,

parameters.data.shuffling_seed = 1234

data_shuffler = mala.DataShuffler(parameters)
data_shuffler.add_snapshot("Be_snapshot0.in.npy", data_path,
                           "Be_snapshot0.out.npy", data_path)
data_shuffler.add_snapshot("Be_snapshot1.in.npy", data_path,
                           "Be_snapshot1.out.npy", data_path)
data_shuffler.shuffle_snapshots(complete_save_path="../",
                                save_name="Be_shuffled*")

The seed parameters.data.shuffling_seed ensures reproducibility of data sets. The shuffle_snapshots function has a path handling ability akin to the DataConverter class. Further, via the number_of_shuffled_snapshots keyword, you can fine-tune the number of new snapshots being created. By default, the same number of snapshots as had been provided will be created (if possible).

Logging metrics during training

Training progress in MALA can be visualized via tensorboard or wandb, as also shown in the file advanced/ex03_tensor_board. Simply select a logger prior to training as

parameters.running.logger = "tensorboard"
parameters.running.logging_dir = "mala_vis"

or

import wandb
wandb.init(
      project="mala_training",
      entity="your_wandb_entity"
)
parameters.running.logger = "wandb"
parameters.running.logging_dir = "mala_vis"

where logging_dir specifies some directory in which to save the MALA logging data. You can also select which metrics to record via

parameters.validation_metrics = ["ldos", "dos", "density", "total_energy"]
Full list of available metrics:
  • “ldos”: MSE of the LDOS.

  • “band_energy”: Band energy.

  • “band_energy_actual_fe”: Band energy computed with ground truth Fermi energy.

  • “total_energy”: Total energy.

  • “total_energy_actual_fe”: Total energy computed with ground truth Fermi energy.

  • “fermi_energy”: Fermi energy.

  • “density”: Electron density.

  • “density_relative”: Rlectron density (Mean Absolute Percentage Error).

  • “dos”: Density of states.

  • “dos_relative”: Density of states (Mean Absolute Percentage Error).

To save time and resources you can specify the logging interval via

parameters.running.validate_every_n_epochs = 10

If you want to monitor the degree to which the model overfits to the training data, you can use the option

parameters.running.validate_on_training_data = True

MALA will evaluate the validation metrics on the training set as well as the validation set.

Afterwards, you can run the training without any other modifications. Once training is finished (or during training, in case you want to use tensorboard to monitor progress), you can launch tensorboard via

tensorboard --logdir path_to_log_directory

The full path for path_to_log_directory can be accessed via trainer.full_logging_path.

If you’re using wandb, you can monitor the training progress on the wandb website.

Training in parallel

If large models or large data sets are employed, training may be slow even if a GPU is used. In this case, multiple GPUs can be employed with MALA using the DistributedDataParallel (DDP) formalism of the torch library. To use DDP, make sure you have NCCL installed on your system.

To activate and use DDP in MALA, almost no modification of your training script is necessary. Simply activate DDP in your Parameters object. Make sure to also enable GPU, since parallel training is currently only supported on GPUs.

parameters = mala.Parameters()
parameters.use_gpu = True
parameters.use_ddp = True

MALA is now set up for parallel training. DDP works across multiple compute nodes on HPC infrastructure as well as on a single machine hosting multiple GPUs. While essentially no modification of the python script is necessary, some modifications for calling the python script may be necessary, to ensure that DDP has all the information it needs for inter/intra-node communication. This setup may differ across machines/clusters. During testing, the following setup was confirmed to work on an HPC cluster using the slurm scheduler.

#SBATCH --nodes=NUMBER_OF_NODES
#SBATCH --ntasks-per-node=NUMBER_OF_TASKS_PER_NODE
#SBATCH --gres=gpu:NUMBER_OF_TASKS_PER_NODE
# Add more arguments as needed
...

# Load more modules as needed
...

# This port can be arbitrarily chosen.
# Given here is the torchrun default
export MASTER_PORT=29500

# Find out the host node.
echo "NODELIST="${SLURM_NODELIST}
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR

# Run using srun.
srun -N NUMBER_OF_NODES -u bash -c '
# Export additional per process variables
export RANK=$SLURM_PROCID
export LOCAL_RANK=$SLURM_LOCALID
export WORLD_SIZE=$SLURM_NTASKS

python3 -u  training.py
'

An overview of environment variables to be set can be found in the official documentation. A general tutorial on DDP itself can be found here.