Improved data conversion

As a general remark please be reminded that if you have not used LAMMPS for your first steps in MALA, and instead used the python-based descriptor calculation methods, we highly advise switching to LAMMPS for advanced/more involved examples (see installation instructions for LAMMPS).

Tuning descriptors

The data conversion shown in the basic MALA guide is straightforward from an interface point of view, however it is non-trivial to determine the correct hyperparameters for the bispectrum descriptors. Ideally, cutoff radius and dimensionality of the descriptors should be chosen such that atomic environments around each point in space are accurately represented.

Some physical intuition factors into this process, as detailed in the MALA publication on hyperparameter optimization. There, it is discussed that too large cutoff radii lead to descriptors that are too similar for different points in space, complicating the ML training, while too small radii lead to descriptors not containing enough physical information, and do not yield physically accurate models. Likewise, the dimensionality of the expansion of the atomic density needs to match the amount of information one wants to encode.

In the publication mentioned above, a method has been devised based on these principles, the so called average cosine similarity distance (ACSD) analysis. Based on cosine similarities between descriptor and target vectors for distinct points, optimal descriptor hyperparameters can be determined; for more detail, please refer to the publication.

Within MALA, this analysis is available as a special hyperparameter optimization routine, as showcased in the example advanced/ex04_acsd.py. The syntax for this analysis is similar to the regular hyperparameter optimization interface. First, set up an optimizer via

import mala

parameters = mala.Parameters()

# Specify the details of the ACSD analysis.
parameters.descriptors.acsd_points = 100
hyperoptimizer = mala.ACSDAnalyzer(parameters)

The acsd_points parameter determines how many points are compared during the ACSD analysis; the more points you select, the longer the analysis takes. If you set this value to 100, 100 points are compared with 100 different points each, yielding a point cloud of 10.000 points for analysis. For most purposes, this should be enough.

Afterwards, specify ranges for the bispectrum hyperparameters over which to search for the optimal set of values via

hyperoptimizer.add_hyperparameter("bispectrum_twojmax", [2, 4])
hyperoptimizer.add_hyperparameter("bispectrum_cutoff", [1.0, 2.0])

These two are the only hyperparameters needed for the bispectrum descriptors. Choose the lists according to the demands of your system; a good starting point is a coarse search over cutoff radii, and bispectrum_twojmax values of 2 to a maximum of 10.

Afterwards, you have to add data to the hyperparameter optimization. This is similar to the regular hyperparameter optimization workflow, with two import distinctions. Firstly, only add preprocessed data for the LDOS; for the descriptors, add a raw calculation output, from which the descriptors can be computed. Secondly, the add_snapshot function is provided directly via the hyperparameter optimizer. No DataHandler instance is needed. An example would be this:

hyperoptimizer.add_snapshot("espresso-out", os.path.join(data_path, "Be_snapshot1.out"),
                            "numpy", os.path.join(data_path, "Be_snapshot1.out.npy"),
                            target_units="1/(Ry*Bohr^3)")
hyperoptimizer.add_snapshot("espresso-out", os.path.join(data_path, "Be_snapshot2.out"),
                            "numpy", os.path.join(data_path, "Be_snapshot2.out.npy"),
                            target_units="1/(Ry*Bohr^3)")

Once this is done, you can start the optimization via

hyperoptimizer.perform_study(return_plotting=False)
hyperoptimizer.set_optimal_parameters()

If return_plotting is set to True, relevant plotting data for the analysis are returned. This is useful for exploratory searches.

Since the ACSD re-calculates the bispectrum descriptors for each combination of hyperparameters, it is useful to use parallel descriptor calculation. To do so, you can enable the MPI capabilites of MALA/LAMMPS. Once enabled, multiple CPUs can be used in parallel to calculate descriptors. Enabling MPI in MALA can easily be done via

parameters.use_mpi = True

If you use MPI, multiple CPUs need to be allocated to the MALA computation.

Parallel data conversion

Parallelization may also generally be used for data conversion via the DataConverter class. Just enable the MPI function in MALA via

parameters.use_mpi = True

prior to using the DataConverter class. Then, all processing will be done in parallel - both the descriptor calculation as well as the LDOS parsing.