Improved hyperparameter optimization

Checkpointing a hyperparameter search

Just like a regular NN training, a hyperparameter search may be checkpointed and resumed at a later point. An example is shown in the file advanced/ex05_checkpoint_hyperparameter_optimization.py. The syntax/interface is very similar as to the NN training checkpointing system, i.e.:

import mala

parameters = mala.Parameters()

parameters.hyperparameters.checkpoints_each_trial = 5
parameters.hyperparameters.checkpoint_name = "ex05_checkpoint"

will enable hyperparameter checkpointing after 5 hyperparameter trials. The checkpointing-restarting is then enabled via

if mala.Trainer.run_exists("ex05_checkpoint"):
    parameters, network, datahandler, trainer = \
        mala.Trainer.load_run("ex05_checkpoint")
    printout("Starting resumed training.")
else:
    parameters, network, datahandler, trainer = initial_setup()
    printout("Starting original training.")

Parallelizing a hyperparameter search

Hyperparameter searches can be very computationally demanding, since many NN trainings have to be performed in order to determine the optimal set of parameters. One way to accelerate the process is through a parallel, distributed hyperparameter search. Here, we run the hyperparameter search on N CPUs/GPUs - each processing one specific trial. The results are collected in a central data base, which is accessed by each of the N ranks individually upon starting new trials. This is not a MALA feature itself, but rather a feature of the optuna library, that MALA implements and gives easy access to.

Database handling is done via SQL. Several SQL frameworks exists, and optuna/MALA support PostgreSQL, MySQL and SQLite. The latter is good for local debugging, but should not be used at scale. An SQLite example for distributed hyperparameter optimization can be found in the file advanced/ex06_distributed_hyperparameter_optimization.py.

Distributed hyperparameter optimization works the same as regular hyperparameter optimization; parameter, data and optimizer setup do not have to be altered. You have to specify the data base (and a name for the study, since multiple studies may be saved in the same data base) to be used, e.g.,

parameters.hyperparameters.study_name = "my_study"
# SQLite example
parameters.hyperparameters.rdb_storage = 'sqlite:///DATABASENAME'
# PostgreSQL example
parameters.hyperparameters.rdb_storage = "postgresql://SERVERNAME/DATABASENAME"

For more information on PostgreSQL and SQLite, please visit the respective websites.

Once this parameter is set, the MALA-optuna interface will automatically use this database to store and initialize all trials. If you then execute the hyperparameter optimization Python script multiple times, the hyperparameter optimization will be parallel - no further changes to the code are necessary.

On an HPC cluster, it may be prudent to launch the script multiple times with a single call, e.g.

mpirun -np 12 python3 -u hyperopt.py

since there a limitations on how many jobs can be launched individually. If you have to do this, MALA provides the helpful option of disentangling these instances via

parameters.optuna_singlenode_setup(wait_time=10)

This command makes sure that the individual instances of the optuna run are started with wait_time time interval in between (to avoid race conditions when accessing the same data base) and further only use the data base, not MPI, for communication.

The batch job on your HPC cluster will get killed after the designated runtime. Then unfinished trials will remain in the Optuna database in state RUNNING.

The current workflow for resuming the study which makes use of MALA’s own resume tooling (see examples/advanced/ex05_checkpoint_hyperparameter_optimization.py) is this: before submitting the batch job again and let the script do the resume work, a user needs to modify the database like so:

python3 -c "import mala; mala.HyperOptOptuna.requeue_zombie_trials('hyperopt01', 'sqlite:///hyperopt.db')"

which will set the RUNNING trials to state WAITING. When Optuna resumes, it will pick up and re-run those, before carrying on running the resumed study.

Common questions related to this feature:

“Does “injecting” jobs like this disturb Optuna’s operation in any way?”: No, the study object takes all of its information directly from the data base, which in this case has “WAITING” trials now.
“Do those trials have to be run?”: Technically not. One could simply ignore them and re-run without them. The problem is that in this case, the study will have missing data points from trials that have been suggested for a reason, so even if Optuna would resume fine, we still want to re-run them from an optimization point of view.

If you do distributed hyperparameter optimization, another useful option is

parameters.hyperparameters.number_training_per_trial = 3

This option tells optuna to run each NN trial training number_training_per_trial times; instead of the accuracy of only one trial, the average accuracy (plus the standard deviation) are reported at the end of each trial. Doing so massively increases robustness of the hyperparameter optimization, since it eliminates models that perform well by chance (e.g., because they have been randomly initialized to be accurate by chance). This option is especially useful if used in conjunction with a physical validation metric such as

parameters.running.final_validation_metric = "band_energy"

Advanced optimization algorithms

As discussed in the MALA publication on hyperparameter optimization, advanced hyperparameter optimization strategies have been evaluated for ML-DFT models with MALA. Namely

NASWOT (Neural architecture search without training): A training-free hyperparameter optimization technique. It works by correlating the capability of a network to distinguish between data points at NN initialization with performance after training.
OAT (Orthogonal array tuning): This technique requires network training, but constructs an optimal set of trials based on orthogonal arrays (a concept from optimal design theory) from which to extract a maximum of information with a limited number of training overhead.

Both methods can easily be enabled without changing the familiar hyperparameter optimization workflow, as shown in the file advanced/ex07_advanced_hyperparameter_optimization.

These optimization algorithms are activated via the Parameters object:

# Use NASWOT
parameters.hyperparameters.hyper_opt_method = "naswot"
# Use OAT
parameters.hyperparameters.hyper_opt_method = "oat"

Both techniques are fully compatible with other MALA capabilities, with a few exceptions:

NASWOT: Can only be used with hyperparameters related to network architecture (layer sizes, activation functions, etc.); training related hyperparameters will be ignored, and a warning to this effect will be printed. Only "categorical" hyperparameters are supported. Can be run in parallel by setting parameters.use_mpi=True.
OAT: Can currently not be run in parallel. Only "categorical" hyperparameters are supported.

For more details on the mathematical background of these methods, please refer to the aforementioned publication.