Training an ML-DFT model

For a first glimpse into MALA, let’s assume you already have input data (i.e., bispectrum descriptors encoding the atomic structure on a real space grid) and output data (the LDOS representing the electronic structure on the real space grid) for a system. For now, we will be using the example data provided in the test data repository.

You will learn how to create your own data sets in the following section. This guide follows the example file basic/ex01_train_network.py and basic/ex02_test_network.py.

Setting parameters

The central object of MALA workflows is the mala.Parameters() object. All options necessary to control a MALA workflow are accessible through this object. To do this, it has several subobjects for its various tasks, that we will get familiar with in this tutorial.

MALA provides reasonable choices for a lot of parameters of interest. For a full list, please refer to the API reference - for now, we select a few options to train a simple network with example data, namely

parameters = mala.Parameters()

parameters.data.input_rescaling_type = "feature-wise-standard"
parameters.data.output_rescaling_type = "minmax"

parameters.network.layer_activations = ["ReLU"]

parameters.running.max_number_epochs = 100
parameters.running.mini_batch_size = 40
parameters.running.learning_rate = 0.00001
parameters.running.optimizer = "Adam"
parameters.verbosity = 1 # level of output; 1 is standard, 0 is low, 2 is debug.

Here, we can see that the Parameters object contains multiple sub-objects dealing with the individual aspects of the workflow. In the first two lines, which data scaling MALA should employ. Scaling data greatly improves the performance of NN based ML models. Options are

None: No scaling is applied.
standard: Standardization (Scale to mean 0, standard deviation 1) is applied to the entire array.
minmax: Min-Max scaling (Scale to be in range 0…1) is applied to the entire array.
feature-wise-standard: Standardization (Scale to mean 0, standard deviation 1) is applied to each feature dimension individually.
feature-wise-minmax: Min-Max scaling (Scale to be in range 0…1) is applied to each feature dimension individually.

Here, we specify that MALA should standardize the input (=descriptors) by feature (i.e., each entry of the vector separately on the grid) and normalize the entire LDOS.

The third line tells MALA which activation function to use, and the last lines specify the training routine employed by MALA.

For now, we will assume these values to be correct. Of course, for new data sets, optimal values have to be determined via hyperparameter optimization. Finally, it is useful to also save some information on how the LDOS and bispectrum descriptors were calculated into the parameters object - this helps at inference time, when this info is required. You will learn what these values mean data generation part of this guide, for now we use the values consistent with the example data.

parameters.targets.target_type = "LDOS"
parameters.targets.ldos_gridsize = 11
parameters.targets.ldos_gridspacing_ev = 2.5
parameters.targets.ldos_gridoffset_ev = -5

parameters.descriptors.descriptor_type = "Bispectrum"
parameters.descriptors.bispectrum_twojmax = 10
parameters.descriptors.bispectrum_cutoff = 4.67637

Adding training data

As with any ML library, MALA is a data-driven framework. So before we can train a model, we need to add data. The central object to manage data for any MALA workflow is the DataHandler class.

MALA manages data “per snapshot”. One snapshot is an atomic configuration with associated volumetric data. Snapshots have to be added to the DataHandler object. There are two ways to provide snapshot data, which are selected by providing the respective types of data files.

Precomputed descriptors: The LDOS is sampled and the volumetric descriptor data is precomputed into either OpenPMD or numpy files (as described here), and both can be loaded for training.

data_handler = mala.DataHandler(parameters)
data_handler.add_snapshot("Be_snapshot0.in.npy", data_path_be,
                          "Be_snapshot0.out.npy", data_path_be, "tr")
data_handler.add_snapshot("Be_snapshot1.in.npy", data_path_be,
                          "Be_snapshot1.out.npy", data_path_be, "va")

On-the-fly descriptors: The LDOS is sampled into either OpenPMD or numpy files, while the volumetric descriptor data is computed on-the-fly during training or shuffling. Starting point for the descriptor calculation in this case is the simulation output saved in a JSON file. This mode is only recommended if a GPU-enabled LAMMPS version is available. If this route is used, then descriptor calculation hyperparamters need to be set before adding snapshots, see data conversion manual for details.

# Bispectrum parameters.
parameters.descriptors.descriptor_type = "Bispectrum"
parameters.descriptors.bispectrum_twojmax = 10
parameters.descriptors.bispectrum_cutoff = 4.67637

data_handler = mala.DataHandler(parameters)
data_handler.add_snapshot("Be_snapshot0.info.json", data_path_be,
                          "Be_snapshot0.out.npy", data_path_be, "tr")
data_handler.add_snapshot("Be_snapshot1.info.json", data_path_be,
                          "Be_snapshot1.out.npy", data_path_be, "va")

The "tr" and "va" flag signal that the respective snapshots are added as training and validation data, respectively. Training data is data the model is directly tuned on; validation data is data used to verify the model performance during the run time and make sure that no overfitting occurs. After data has been added to the DataHandler, it has to be actually loaded (or in the case of on-the-fly usage, computed) and scaled via

data_handler.prepare_data()

The DataHandler object can now be used for Machine learning.

Building and training a model

MALA uses neural networks (NNs) as a backbone for the ML-DFT models. To construct those, we have to specify the number of neurons. This is also done via the Parameters object. In principle, we can specify the layer sizes whenever we want, however, it makes sense to do this after the data has been loaded, because then it is easier to make sure that the dimensions of the layers agree. To build a NN, we specify

parameters.network.layer_sizes = [data_handler.input_dimension,
                                  100,
                                  data_handler.output_dimension]
network = mala.Network(parameters)

Now, we can easily train this network with the parameters specified above by doing

trainer = mala.Trainer(parameters, network, data_handler)
trainer.train_network()

Afterwards, we want to save this model for future use. MALA saves models in a *.zip format. Within each model archive, information like scaling coefficients, the model weights itself, etc. are stored in one place where MALA can easily access it. Additionally, it makes sense to provide MALA with a sample calculation output (from the simulations used to gather the training data), so that critical parameters like simulation temperature, grid coarseness, etc., are available at inference time. By

additional_calculation_data = os.path.join(data_path_be, "Be_snapshot0.out")
trainer.save_run("be_model",
                 additional_calculation_data=additional_calculation_data)

This information is set and the resulting model is saved. It is now ready to be used.

Testing a model

Before using a model in production, it is wise to test its performance. To that end, MALA provides a Tester class, that allows users to load a model, give it some data unseen during training, and verify the models performance on that data.

This verification is done by selecting observables of interest (e.g., the band energy, total energy or number of electrons) and comparing ML-DFT predictions with the ground truth. To instantiate a Tester object, call

parameters, network, data_handler, tester = mala.Tester.load_run("be_model")

There are a few useful options we should set when testing a network. Firstly, we need to specify which observables to test. Secondly, we have to decide if we want the resulting accuracy measures per each individual snapshot ("list") or as an average across all snapshots ("mae"). Finally, it is useful to enable lazy-loading. Lazy-loading is a feature that incrementally loads data into memory. It is necessary when operating on large amounts of data; its usage in the training routine is further discussed in the advanced training section. For testing a model, it is prudent to enable, since a lot of data may be involved. The accompanying syntax for these three options is

tester.observables_to_test = ["band_energy", "number_of_electrons"]
tester.output_format = "list"
parameters.data.use_lazy_loading = True

Afterwards, new data can be added just as shown above, now with the data function being "te" for testing data. Once this is done, testing can be done via

results = tester.test_all_snapshots()

Resulting in a dictionary, which can either be saved into a .csv file or directly processed.