data_handler

DataHandler class that loads and scales data.

class DataHandler(parameters: Parameters, target_calculator=None, descriptor_calculator=None, input_data_scaler=None, output_data_scaler=None, clear_data=True)[source]

Bases: DataHandlerBase

Loads and scales data. Can load from numpy or OpenPMD files.

Data that is not saved as numpy or OpenPMD file can be converted using the DataConverter class.

Parameters:

parameters (mala.common.parameters.Parameters) – Parameters used to create the data handling object.
descriptor_calculator (mala.descriptors.descriptor.Descriptor) – Used to do unit conversion on input data. If None, then one will be created by this class.
target_calculator (mala.targets.target.Target) – Used to do unit conversion on output data. If None, then one will be created by this class.
input_data_scaler (mala.datahandling.data_scaler.DataScaler) – Used to scale the input data. If None, then one will be created by this class.
output_data_scaler (mala.datahandling.data_scaler.DataScaler) – Used to scale the output data. If None, then one will be created by this class.
clear_data (bool) – If true (default), the data list will be cleared upon creation of the object.

input_data_scaler

Used to scale the input data.

Type:: mala.datahandling.data_scaler.DataScaler

nr_test_data

Number of test data points.

Type:: int

nr_test_snapshots

Number of test snapshots.

Type:: int

nr_training_data

Number of training data points.

Type:: int

nr_training_snapshots

Number of training snapshots.

Type:: int

nr_validation_data

Number of validation data points.

Type:: int

nr_validation_snapshots

Number of validation snapshots.

Type:: int

output_data_scaler

Used to scale the output data.

Type:: mala.datahandling.data_scaler.DataScaler

test_data_sets

List containing torch data sets for test data.

Type:: list

training_data_sets

List containing torch data sets for training data.

Type:: list

validation_data_sets

List containing torch data sets for validation data.

Type:: list

clear_data()[source]

Reset the entire data pipeline.

Useful when doing multiple investigations in the same python file.

get_snapshot_calculation_output(snapshot_number)[source]

Get the path to the output file for a specific snapshot.

Parameters:: snapshot_number (int) – Snapshot for which the calculation output should be returned.
Returns:: calculation_output – Path to the calculation output for this snapshot.
Return type:: string

get_test_input_gradient(snapshot_number)[source]

Get the gradient of the test inputs for an entire snapshot.

This gradient will be returned as scaled Tensor. The reason the gradient is returned (rather then returning the entire inputs themselves) is that by slicing a variable, pytorch no longer considers it a “leaf” variable and will stop tracking and evaluating its gradient. Thus, it is easier to obtain the gradient and then slice it.

Parameters:: snapshot_number (int) – Number of the snapshot for which the entire test inputs.
Returns:: gradient – Tensor holding the gradient.
Return type:: torch.Tensor

mix_datasets()[source]

For lazily-loaded data sets, the snapshot ordering is (re-)mixed.

This applies only to the training data set. For the validation and test set it does not matter.

prepare_data(reparametrize_scaler=True)[source]

Prepare the data to be used in a training process.

This includes:

Checking snapshots for consistency

Parametrizing the DataScalers (if desired)

Building DataSet objects.

Parameters:: reparametrize_scaler (bool) – If True (default), the DataScalers are parametrized based on the training data.

prepare_for_testing()[source]

Prepare DataHandler for usage within Tester class.

Ensures that lazily-loaded data sets do not perform unnecessary I/O operations. Only needed in Tester class.

raw_numpy_to_converted_scaled_tensor(numpy_array, data_type, units)[source]

Transform a raw numpy array into a scaled torch tensor.

This tensor will also be in the right units, i.e. a tensor that can simply be put into a MALA network.

Parameters:

numpy_array (numpy.ndarray) – Array that is to be converted.
data_type (string) –
Either “in” or “out”, depending if input or output data is

processed.
units (string) – Units of the data that is processed.

Returns:

converted_tensor – The fully converted and scaled tensor.

Return type:

torch.Tensor

resize_snapshots_for_debugging(directory='./', naming_scheme_input='test_Al_debug_2k_nr*.in', naming_scheme_output='test_Al_debug_2k_nr*.out')[source]

Resize all snapshots in the list.

Parameters:

directory (string) – Directory to which the resized snapshots should be saved.
naming_scheme_input (string) – Naming scheme for the resulting input numpy files.
naming_scheme_output (string) – Naming scheme for the resulting output numpy files.