lazy_load_dataset_single

DataSet for lazy-loading.

class LazyLoadDatasetSingle(*args: Any, **kwargs: Any)[source]

Bases: Dataset

DataSet class for lazy loading.

Only loads snapshots in the memory that are currently being processed. Uses a “caching” approach of keeping the last used snapshot in memory, until values from a new ones are used. Therefore, shuffling at DataSampler / DataLoader level is discouraged to the point that it was disabled. Instead, we mix the snapshot load order here ot have some sort of mixing at all.

Parameters:

input_dimension (int) – Dimension of an input vector.
output_dimension (int) – Dimension of an output vector.
input_data_scaler (mala.datahandling.data_scaler.DataScaler) – Used to scale the input data.
output_data_scaler (mala.datahandling.data_scaler.DataScaler) – Used to scale the output data.
descriptor_calculator (mala.descriptors.descriptor.Descriptor) – Used to do unit conversion on input data.
target_calculator (mala.targets.target.Target or derivative) – Used to do unit conversion on output data.
use_ddp (bool) – If true, it is assumed that ddp is used.
input_requires_grad (bool) – If True, then the gradient is stored for the inputs.

allocated

True if dataset is allocated.

Type:: bool

currently_loaded_file

Index of currently loaded file

Type:: int

descriptor_calculator

Used to do unit conversion on input data.

Type:: mala.descriptors.descriptor.Descriptor

input_data

Input data tensor.

Type:: torch.Tensor

input_dtype

Input data type.

Type:: numpy.dtype

input_shape

Input data dimensions

Type:: list

input_shm_name

Name of shared memory allocated for input data

Type:: str

loaded

True if data has been loaded to shared memory.

Type:: bool

output_data

Output data tensor.

Type:: torch.Tensor

output_dtype

Output data dtype.

Type:: numpy.dtype

output_shape

Output data dimensions.

Type:: list

output_shm_name

Name of shared memory allocated for output data.

Type:: str

return_outputs_directly

Control whether outputs are actually transformed. Has to be False for training. In the testing case, Numerical errors are smaller if set to True.

Type:: bool

snapshot

Currently loaded snapshot object.

Type:: mala.datahandling.snapshot.Snapshot

target_calculator

Used to do unit conversion on output data.

Type:: mala.targets.target.Target or derivative

allocate_shared_mem()[source]

Allocate the shared memory buffer for use by prefetching process.

Buffer is sized via numpy metadata.

deallocate_shared_mem()[source]: Deallocate the shared memory buffer used by prefetching process.

delete_data()[source]: Free the shared memory buffers.

mix_datasets()[source]

Shuffle the data in this data set.

For this class, instead of mixing the datasets, we just shuffle the indices and leave the dataset order unchanged. NOTE: It seems that the shuffled access to the shared memory performance is much reduced (relative to the FastTensorDataset). To regain performance, can rewrite to shuffle the datasets like in the existing LazyLoadDataset. Another option might be to try loading the numpy file in permuted order to avoid the shuffled reads; however, this might require some care to avoid erroneously overwriting shared memory data in cases where a single dataset object is used back to back.