data_shuffler

Mixes data between snapshots for improved lazy-loading training.

class DataShuffler(parameters: Parameters, target_calculator=None, descriptor_calculator=None)[source]

Bases: DataHandlerBase

Mixes data between snapshots for improved lazy-loading training.

This is a DISK operation - new, shuffled snapshots will be created on disk.

Parameters:
add_snapshot(input_file, input_directory, output_file, output_directory, snapshot_type='numpy')[source]

Add a snapshot to the data pipeline.

Parameters:
  • input_file (string) – File with saved numpy input array.

  • input_directory (string) – Directory containing input_npy_directory.

  • output_file (string) – File with saved numpy output array.

  • output_directory (string) – Directory containing output_npy_file.

  • snapshot_type (string) – Either “numpy” or “openpmd” based on what kind of files you want to operate on.

shuffle_snapshots(complete_save_path=None, descriptor_save_path=None, target_save_path=None, save_name='mala_shuffled_snapshot*', number_of_shuffled_snapshots=None)[source]

Shuffle the snapshots into new snapshots.

This saves them to file.

Parameters:
  • complete_save_path (string) – If not None: the directory in which all snapshots will be saved. Overwrites descriptor_save_path, target_save_path and additional_info_save_path if set.

  • descriptor_save_path (string) – Directory in which to save descriptor data.

  • target_save_path (string) – Directory in which to save target data.

  • save_name (string) – Name of the snapshots to be shuffled.

  • number_of_shuffled_snapshots (int) – If not None, this class will attempt to redistribute the data to this amount of snapshots. If None, then the same number of snapshots provided will be used.