data_handler
DataHandler class that loads and scales data.
- class DataHandler(parameters: Parameters, target_calculator=None, descriptor_calculator=None, input_data_scaler=None, output_data_scaler=None, clear_data=True)[source]
Bases:
DataHandlerBase
Loads and scales data. Can load from numpy or OpenPMD files.
Data that is not saved as numpy or OpenPMD file can be converted using the DataConverter class.
- Parameters:
parameters (mala.common.parameters.Parameters) – Parameters used to create the data handling object.
descriptor_calculator (mala.descriptors.descriptor.Descriptor) – Used to do unit conversion on input data. If None, then one will be created by this class.
target_calculator (mala.targets.target.Target) – Used to do unit conversion on output data. If None, then one will be created by this class.
input_data_scaler (mala.datahandling.data_scaler.DataScaler) – Used to scale the input data. If None, then one will be created by this class.
output_data_scaler (mala.datahandling.data_scaler.DataScaler) – Used to scale the output data. If None, then one will be created by this class.
clear_data (bool) – If true (default), the data list will be cleared upon creation of the object.
- input_data_scaler
Used to scale the input data.
- nr_test_data
Number of test data points.
- Type:
int
- nr_test_snapshots
Number of test snapshots.
- Type:
int
- nr_training_data
Number of training data points.
- Type:
int
- nr_training_snapshots
Number of training snapshots.
- Type:
int
- nr_validation_data
Number of validation data points.
- Type:
int
- nr_validation_snapshots
Number of validation snapshots.
- Type:
int
- output_data_scaler
Used to scale the output data.
- test_data_sets
List containing torch data sets for test data.
- Type:
list
- training_data_sets
List containing torch data sets for training data.
- Type:
list
- validation_data_sets
List containing torch data sets for validation data.
- Type:
list
- clear_data()[source]
Reset the entire data pipeline.
Useful when doing multiple investigations in the same python file.
- get_snapshot_calculation_output(snapshot_number)[source]
Get the path to the output file for a specific snapshot.
- Parameters:
snapshot_number (int) – Snapshot for which the calculation output should be returned.
- Returns:
calculation_output – Path to the calculation output for this snapshot.
- Return type:
string
- get_test_input_gradient(snapshot_number)[source]
Get the gradient of the test inputs for an entire snapshot.
This gradient will be returned as scaled Tensor. The reason the gradient is returned (rather then returning the entire inputs themselves) is that by slicing a variable, pytorch no longer considers it a “leaf” variable and will stop tracking and evaluating its gradient. Thus, it is easier to obtain the gradient and then slice it.
- Parameters:
snapshot_number (int) – Number of the snapshot for which the entire test inputs.
- Returns:
gradient – Tensor holding the gradient.
- Return type:
torch.Tensor
- mix_datasets()[source]
For lazily-loaded data sets, the snapshot ordering is (re-)mixed.
This applies only to the training data set. For the validation and test set it does not matter.
- prepare_data(reparametrize_scaler=True)[source]
Prepare the data to be used in a training process.
This includes:
Checking snapshots for consistency
Parametrizing the DataScalers (if desired)
Building DataSet objects.
- Parameters:
reparametrize_scaler (bool) – If True (default), the DataScalers are parametrized based on the training data.
- prepare_for_testing()[source]
Prepare DataHandler for usage within Tester class.
Ensures that lazily-loaded data sets do not perform unnecessary I/O operations. Only needed in Tester class.
- raw_numpy_to_converted_scaled_tensor(numpy_array, data_type, units)[source]
Transform a raw numpy array into a scaled torch tensor.
This tensor will also be in the right units, i.e. a tensor that can simply be put into a MALA network.
- Parameters:
numpy_array (numpy.ndarray) – Array that is to be converted.
data_type (string) –
Either “in” or “out”, depending if input or output data is
processed.
units (string) – Units of the data that is processed.
- Returns:
converted_tensor – The fully converted and scaled tensor.
- Return type:
torch.Tensor
- resize_snapshots_for_debugging(directory='./', naming_scheme_input='test_Al_debug_2k_nr*.in', naming_scheme_output='test_Al_debug_2k_nr*.out')[source]
Resize all snapshots in the list.
- Parameters:
directory (string) – Directory to which the resized snapshots should be saved.
naming_scheme_input (string) – Naming scheme for the resulting input numpy files.
naming_scheme_output (string) – Naming scheme for the resulting output numpy files.