lazy_load_dataset_single
DataSet for lazy-loading.
- class LazyLoadDatasetSingle(*args: Any, **kwargs: Any)[source]
Bases:
Dataset
DataSet class for lazy loading.
Only loads snapshots in the memory that are currently being processed. Uses a “caching” approach of keeping the last used snapshot in memory, until values from a new ones are used. Therefore, shuffling at DataSampler / DataLoader level is discouraged to the point that it was disabled. Instead, we mix the snapshot load order here ot have some sort of mixing at all.
- Parameters:
input_dimension (int) – Dimension of an input vector.
output_dimension (int) – Dimension of an output vector.
input_data_scaler (mala.datahandling.data_scaler.DataScaler) – Used to scale the input data.
output_data_scaler (mala.datahandling.data_scaler.DataScaler) – Used to scale the output data.
descriptor_calculator (mala.descriptors.descriptor.Descriptor) – Used to do unit conversion on input data.
target_calculator (mala.targets.target.Target or derivative) – Used to do unit conversion on output data.
use_ddp (bool) – If true, it is assumed that ddp is used.
input_requires_grad (bool) – If True, then the gradient is stored for the inputs.
- allocated
True if dataset is allocated.
- Type:
bool
- currently_loaded_file
Index of currently loaded file
- Type:
int
- descriptor_calculator
Used to do unit conversion on input data.
- input_data
Input data tensor.
- Type:
torch.Tensor
- input_dtype
Input data type.
- Type:
numpy.dtype
- input_shape
Input data dimensions
- Type:
list
- input_shm_name
Name of shared memory allocated for input data
- Type:
str
- loaded
True if data has been loaded to shared memory.
- Type:
bool
- output_data
Output data tensor.
- Type:
torch.Tensor
- output_dtype
Output data dtype.
- Type:
numpy.dtype
- output_shape
Output data dimensions.
- Type:
list
- output_shm_name
Name of shared memory allocated for output data.
- Type:
str
- return_outputs_directly
Control whether outputs are actually transformed. Has to be False for training. In the testing case, Numerical errors are smaller if set to True.
- Type:
bool
- snapshot
Currently loaded snapshot object.
- target_calculator
Used to do unit conversion on output data.
- Type:
mala.targets.target.Target or derivative
Allocate the shared memory buffer for use by prefetching process.
Buffer is sized via numpy metadata.
Deallocate the shared memory buffer used by prefetching process.
- mix_datasets()[source]
Shuffle the data in this data set.
For this class, instead of mixing the datasets, we just shuffle the indices and leave the dataset order unchanged. NOTE: It seems that the shuffled access to the shared memory performance is much reduced (relative to the FastTensorDataset). To regain performance, can rewrite to shuffle the datasets like in the existing LazyLoadDataset. Another option might be to try loading the numpy file in permuted order to avoid the shuffled reads; however, this might require some care to avoid erroneously overwriting shared memory data in cases where a single dataset object is used back to back.