data_shuffler

Mixes data between snapshots for improved lazy-loading training.

class DataShuffler(parameters: Parameters, target_calculator=None, descriptor_calculator=None)[source]

Bases: DataHandlerBase

Mixes data between snapshots for improved lazy-loading training.

This is a DISK operation - new, shuffled snapshots will be created on disk.

Parameters:
temporary_shuffled_snapshots

A list containing snapshot objects of temporary, snapshot-like shuffled data files. By default, this list is empty. If the function “shuffle_snapshots_temporary” is used, it will be populated with temporary files saved to hard drive, which can be deleted after model training. Please note that the “snapshot_function”, “input_units”, “output_units” and “calculation_output” fields of the snapshots within this list

Type:

list

add_snapshot(input_file, input_directory, output_file, output_directory, snapshot_type=None)[source]

Add a snapshot to the data pipeline.

Parameters:
  • input_file (string) – File with saved numpy input array.

  • input_directory (string) – Directory containing input_npy_directory.

  • output_file (string) – File with saved numpy output array.

  • output_directory (string) – Directory containing output_npy_file.

  • snapshot_type (string) – Either “numpy” or “openpmd” based on what kind of files you want to operate on.

delete_temporary_shuffled_snapshots()[source]

Delete temporary files creating during shuffling of data.

If shuffling has been done with the option “shuffle_to_temporary”, shuffled data will be saved to temporary files which can safely be deleted with this function.

shuffle_snapshots(complete_save_path=None, descriptor_save_path=None, target_save_path=None, save_name='mala_shuffled_snapshot*', number_of_shuffled_snapshots=None, shuffle_to_temporary=False)[source]

Shuffle the snapshots into new snapshots.

This saves them to file.

Parameters:
  • complete_save_path (string) – If not None: the directory in which all snapshots will be saved. Overwrites descriptor_save_path and target_save_path if set.

  • descriptor_save_path (string) – Directory in which to save descriptor data.

  • target_save_path (string) – Directory in which to save target data.

  • save_name (string) – Name of the snapshots to be shuffled.

  • number_of_shuffled_snapshots (int) – If not None, this class will attempt to redistribute the data to this amount of snapshots. If None, then the same number of snapshots provided will be used.

  • shuffle_to_temporary (bool) – If True, shuffled files will be writen to temporary data files. Which paths are used is consistent with non-temporary usage of this class. The path and names of these temporary files can then be found in the class attribute temporary_shuffled_snapshots.