bridgescaler package#
Submodules#
bridgescaler.backend module#
- class bridgescaler.backend.NumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)#
Bases:
JSONEncoderCustom encoder for numpy data types
- default(obj)#
Implement this method in a subclass such that it returns a serializable object for
o, or calls the base implementation (to raise aTypeError).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return super().default(o)
- bridgescaler.backend.apply_to_dict_leaves(d, operation)#
Recursively applies an operation to each leaf value in a nested dictionary.
- Parameters:
d (dict) – A nested dictionary where the operation will be applied to each leaf value.
operation (callable) – A function to apply to each leaf value.
- Returns:
- A nested dictionary with the same structure as
d, where each leaf is the result of
operation(leaf).
- A nested dictionary with the same structure as
- Return type:
dict
- bridgescaler.backend.create_synthetic_data()#
- bridgescaler.backend.ensure_torch()#
Validates torch installation and load the module.
- bridgescaler.backend.load_scaler(scaler_file)#
Initialize scikit-learn or bridgescaler scaler from saved json file.
- Parameters:
scaler_file – path to json file.
- Returns:
scaler object.
- bridgescaler.backend.load_scaler_dict(scaler_dict_file)#
Loads and deserializes a nested dictionary of Bridgescaler scalers from a JSON file.
- Parameters:
scaler_dict_file (str or Path) – The file path to the JSON file containing the serialized scaler dictionary.
- Returns:
- A nested dictionary of reconstructed scaler objects, with the
same structure as the original dictionary passed to
save_scaler_dict.
- Return type:
dict
- bridgescaler.backend.object_hook(dct: dict[Any, Any])#
- bridgescaler.backend.print_scaler(scaler)#
Output scikit-learn or bridgescaler scaler object to json string.
- Parameters:
scaler – scikit-learn-style scaler object
- Returns:
str representation of object in json format
- bridgescaler.backend.read_scaler(scaler_str)#
Initialize scikit-learn or bridgescaler scaler from json str.
- Parameters:
scaler_str – json str
- Returns:
scaler object.
- bridgescaler.backend.save_scaler(scaler, scaler_file)#
Save a scikit-learn or bridgescaler scaler object to json format.
- Parameters:
scaler – scikit-learn-style scaler object
scaler_file – path to json file where scaler information is stored.
- bridgescaler.backend.save_scaler_dict(scaler_dict, scaler_dict_file)#
Serializes and saves a nested dictionary of Bridgescaler scalers to a JSON file.
- Parameters:
scaler_dict (dict) – A nested dictionary of fitted Bridgescaler scaler objects to be saved.
scaler_dict_file (str or Path) – The file path where the scaler dictionary will be saved as a JSON file.
- bridgescaler.backend.scale_var_dict(var_dict, scalers, method, var_list=None, _key_path=())#
Recursively traverses a nested dict of tensor variables and applies a scaler method to each variable.
- Parameters:
var_dict (dict) – A nested dictionary where leaves are variables in torch.Tensor to be scaled.
scalers (object or dict) – A single scaler instance (for
fitandfit_transform) or a nested dict of scalers matching the structure ofvar_dict(fortransformandinverse_transform).method (str) – The scaler method to apply. Must be one of
fit,transform,inverse_transform, orfit_transform.var_list (list of str, optional) – A list of leaf key names to apply the scaler method to. Keys not in
var_listare skipped duringfit, and left unchanged duringtransform,inverse_transform, andfit_transform. IfNone, all leaf keys are processed.
- Returns:
- A nested dictionary with the same structure as
var_dict, where each leaf is either a fitted scaler (for
fit) or a transformed variable (fortransform,inverse_transform,fit_transform). Keys namedmetadataand keys excluded byvar_listare omitted forfit, and passed through unchanged for other methods.
- A nested dictionary with the same structure as
- Return type:
dict
- Raises:
AssertionError – If
var_dictis not a dict.AssertionError – If
methodis not one of the valid methods.AssertionError – If
scalersis not a dict when usingtransformorinverse_transform.AssertionError – If a key path in
var_dictis missing inscalers.AssertionError – If a scaler at a given key path does not have the requested
method.
Example
>>> import torch >>> from bridgescaler.distributed_tensor import DStandardScalerTensor >>> from bridgescaler.backend import scale_var_dict >>> T = torch.randn((20, 5, 4, 8)) >>> var_dict = { "era5": { "input": {"era5/prognostic/3d/T": T}, "target": {"era5/prognostic/3d/T": T}, "metadata": {"input_datetime": int, "target_datetime": int} } } >>> scalers = DStandardScaler(channels_last=False) >>> scaler_dict = scale_var_dict(var_dict, scalers, method="fit") >>> transformed = scale_var_dict(var_dict, scaler_dict, method="transform") >>> inverse_transformed = scale_var_dict(transformed, scaler_dict, method="inverse_transform") >>> fitted_transformed = scale_var_dict(var_dict, scalers, method="fit_transform") >>> # Only scale specific variables >>> filtered = scale_var_dict(var_dict, scaler_dict, method="transform", var_list=["era5/prognostic/3d/T"])
bridgescaler.deep module#
- class bridgescaler.deep.DeepMinMaxScaler#
Bases:
object- fit(x)#
- fit_transform(x)#
- inverse_transform(x)#
- transform(x)#
- class bridgescaler.deep.DeepQuantileTransformer(n_quantiles=1000, stochastic=False)#
Bases:
objectPerforms a quantile transform on N-dimensional arrays where the variable dimension is the last one.
- n_quantiles#
number of quantiles to calculate and store
- stochastic#
When transforming to quantile space, whether to take the mean of the left and right interpolation values (False) or to pick a random point in between (True).
- fit(x)#
- fit_transform(x)#
- inverse_transform(x)#
- transform(x)#
bridgescaler.distributed module#
- class bridgescaler.distributed.DBaseScaler(channels_last=True)#
Bases:
objectBase distributed scaler class. Used only to store attributes and methods shared across all distributed scaler subclasses.
- add_variables(other)#
- static extract_array(x)#
- static extract_x_columns(x, channels_last=True)#
Extract the variable names to be transformed from x depending on if x is a pandas DataFrame, an xarray DataArray, or a numpy array. All of these assume that the columns are in the last dimension. If x is an xarray DataArray, there should be a coorindate variable with the same name as the last dimension of the DataArray being transformed.
- Parameters:
x (Union[pandas.DataFrame, xarray.DataArray, numpy.ndarray]) – array of values to be transformed.
channels_last (bool) – If True, then assume the variable or channel dimension is the last dimension of the array. If False, then assume the variable or channel dimension is second.
- Returns:
Array of values to be transformed. is_array (bool): Whether or not x was a np.ndarray.
- Return type:
xv (numpy.ndarray)
- fit(x, weight=None)#
- fit_transform(x, channels_last=None, weight=None)#
- get_column_order(x_in_columns)#
Get the indices of the scaler columns that have the same name as the columns in the input x array. This enables users to pass a DataFrame or DataArray to transform or inverse_transform with fewer columns than the original scaler or columns in a different order and still have the input dataset be transformed properly.
- Parameters:
x_in_columns (Union[list, numpy.ndarray]) – list of input columns.
- Returns:
indices of the input columns from x in the scaler in order.
- Return type:
x_in_col_indices (np.ndarray)
- inverse_transform(x, channels_last=None)#
- is_fit()#
- static package_transformed_x(x_transformed, x)#
Repackaged a transformed numpy array into the same datatype as the original x, including all metadata.
- Parameters:
x_transformed (numpy.ndarray) – array after being transformed or inverse transformed
x (Union[pandas.DataFrame, xarray.DataArray, numpy.ndarray])
Returns:
- process_x_for_transform(x, channels_last=None)#
- set_channel_dim(channels_last=None)#
- subset_columns(sel_columns)#
- transform(x, channels_last=None)#
- class bridgescaler.distributed.DMinMaxScaler(channels_last=True)#
Bases:
DBaseScalerDistributed MinMaxScaler enables calculation of min and max of variables in datasets in parallel then combining the mins and maxes as a reduction step. Scaler supports numpy arrays, pandas dataframes, and xarray DataArrays and will return a transformed array in the same form as the original with column or coordinate names preserved.
- fit(x, weight=None)#
- get_scales()#
- inverse_transform(x, channels_last=None)#
- transform(x, channels_last=None)#
- class bridgescaler.distributed.DQuantileScaler(compression=250, distribution='uniform', min_val=1e-07, max_val=0.9999999, channels_last=True)#
Bases:
DBaseScalerDistributed Quantile Scaler that uses the crick TDigest Cython library to compute quantiles across multiple datasets in parallel. The library can perform fitting, transforms, and inverse transforms across variables in parallel using the multiprocessing library. Multidimensional arrays are stored in shared memory across processes to minimize inter-process communication.
DQuantileScaler supports
- compression#
Recommended number of centroids to use.
- distribution#
“uniform”, “normal”, or “logistic”.
- min_val#
Minimum value for quantile to prevent -inf results when distribution is normal or logistic.
- max_val#
Maximum value for quantile to prevent inf results when distribution is normal or logistic.
- channels_last#
Whether to assume the last dim or second dim are the channel/variable dimension.
- attributes_to_td_objs()#
- fit(x, weight=None)#
- fit_transform(x, channels_last=None, weight=None, pool=None)#
- inverse_transform(x, channels_last=None, pool=None)#
- td_objs_to_attributes(td_objs)#
- transform(x, channels_last=None, pool=None)#
- class bridgescaler.distributed.DStandardScaler(channels_last=True)#
Bases:
DBaseScalerDistributed version of StandardScaler. You can calculate this map-reduce style by running it on individual data files, return the fitted objects, and then sum them together to represent the full dataset. Scaler supports numpy arrays, pandas dataframes, and xarray DataArrays and will return a transformed array in the same form as the original with column or coordinate names preserved.
- fit(x, weight=None)#
- get_scales()#
- inverse_transform(x, channels_last=None)#
- transform(x, channels_last=None)#
Transform the input data from its original form to standard scaled form. If your input data has a different dimension order than the data used to fit the scaler, use the channels_last keyword argument to specify whether the new data are channels_last (True) or channels_first (False).
- Parameters:
x – Input data.
channels_last – Override the default channels_last parameter of the scaler.
- Returns:
Transformed data in the same shape and type as x.
- Return type:
x_transformed
- bridgescaler.distributed.fit_variable(var_index, xv_shared=None, compression=None, channels_last=None)#
- bridgescaler.distributed.inv_transform_variable(td_obj, xv, distribution='normal')#
- bridgescaler.distributed.transform_variable(td_obj, xv, min_val=1e-06, max_val=0.9999999, distribution='normal')#
bridgescaler.distributed_tensor module#
bridgescaler.group module#
- class bridgescaler.group.GroupBaseScaler#
Bases:
object- extract_x_columns(x)#
Extract the variable names to be transformed from x depending on if x is a pandas DataFrame, an xarray DataArray, or a numpy array. All of these assume that the columns are in the last dimension. If x is an xarray DataArray, there should be a coorindate variable with the same name as the last dimension of the DataArray being transformed.
- Parameters:
x (Union[pandas.DataFrame, xarray.DataArray, numpy.ndarray]) – array of values to be transformed.
- Returns:
Array of values to be transformed.
- Return type:
xv (numpy.ndarray)
- find_group(var_name)#
- fit(x, groups=None)#
- fit_transform(x, groups=None)#
- inverse_transform(x)#
- static package_transformed_x(x_transformed, x)#
Repackaged a transformed numpy array into the same datatype as the original x, including all metadata.
- Parameters:
x_transformed (numpy.ndarray) – array after being transformed or inverse transformed
x (Union[pandas.DataFrame, xarray.DataArray, numpy.ndarray])
Returns:
- set_groups(x, groups)#
- transform(x)#
- class bridgescaler.group.GroupMinMaxScaler(feature_range=(0, 1))#
Bases:
GroupBaseScalerGroup version of MinMaxScaler
- class bridgescaler.group.GroupRobustScaler(quartile_range=(25.0, 75.0))#
Bases:
GroupBaseScalerGroup version of RobustScaler
- class bridgescaler.group.GroupStandardScaler#
Bases:
GroupBaseScalerScaler that enables calculation and sharing of scaling parameters among multiple variables via variable groupings. This is useful for situations where variables are related, such as temperatures at different height levels.
Groups are specified as a list of column ids, which can be column names for pandas dataframes or column indices for numpy arrays.
For example:
` groups = [["a", "b"], ["c", "d"], "e"] `“a” and “b” are a single group and all values of both will be included when calculating the mean and standard deviation for that group.
Module contents#
- class bridgescaler.DMinMaxScaler(channels_last=True)#
Bases:
DBaseScalerDistributed MinMaxScaler enables calculation of min and max of variables in datasets in parallel then combining the mins and maxes as a reduction step. Scaler supports numpy arrays, pandas dataframes, and xarray DataArrays and will return a transformed array in the same form as the original with column or coordinate names preserved.
- fit(x, weight=None)#
- get_scales()#
- inverse_transform(x, channels_last=None)#
- transform(x, channels_last=None)#
- class bridgescaler.DQuantileScaler(compression=250, distribution='uniform', min_val=1e-07, max_val=0.9999999, channels_last=True)#
Bases:
DBaseScalerDistributed Quantile Scaler that uses the crick TDigest Cython library to compute quantiles across multiple datasets in parallel. The library can perform fitting, transforms, and inverse transforms across variables in parallel using the multiprocessing library. Multidimensional arrays are stored in shared memory across processes to minimize inter-process communication.
DQuantileScaler supports
- compression#
Recommended number of centroids to use.
- distribution#
“uniform”, “normal”, or “logistic”.
- min_val#
Minimum value for quantile to prevent -inf results when distribution is normal or logistic.
- max_val#
Maximum value for quantile to prevent inf results when distribution is normal or logistic.
- channels_last#
Whether to assume the last dim or second dim are the channel/variable dimension.
- attributes_to_td_objs()#
- fit(x, weight=None)#
- fit_transform(x, channels_last=None, weight=None, pool=None)#
- inverse_transform(x, channels_last=None, pool=None)#
- td_objs_to_attributes(td_objs)#
- transform(x, channels_last=None, pool=None)#
- class bridgescaler.DStandardScaler(channels_last=True)#
Bases:
DBaseScalerDistributed version of StandardScaler. You can calculate this map-reduce style by running it on individual data files, return the fitted objects, and then sum them together to represent the full dataset. Scaler supports numpy arrays, pandas dataframes, and xarray DataArrays and will return a transformed array in the same form as the original with column or coordinate names preserved.
- fit(x, weight=None)#
- get_scales()#
- inverse_transform(x, channels_last=None)#
- transform(x, channels_last=None)#
Transform the input data from its original form to standard scaled form. If your input data has a different dimension order than the data used to fit the scaler, use the channels_last keyword argument to specify whether the new data are channels_last (True) or channels_first (False).
- Parameters:
x – Input data.
channels_last – Override the default channels_last parameter of the scaler.
- Returns:
Transformed data in the same shape and type as x.
- Return type:
x_transformed
- class bridgescaler.DeepMinMaxScaler#
Bases:
object- fit(x)#
- fit_transform(x)#
- inverse_transform(x)#
- transform(x)#
- class bridgescaler.DeepQuantileTransformer(n_quantiles=1000, stochastic=False)#
Bases:
objectPerforms a quantile transform on N-dimensional arrays where the variable dimension is the last one.
- n_quantiles#
number of quantiles to calculate and store
- stochastic#
When transforming to quantile space, whether to take the mean of the left and right interpolation values (False) or to pick a random point in between (True).
- fit(x)#
- fit_transform(x)#
- inverse_transform(x)#
- transform(x)#
- class bridgescaler.DeepStandardScaler#
Bases:
objectCalculate standard scaler scores on an arbitrarily dimensional dataset as long as the last dimension is the variable dimension.
- fit(x)#
- fit_transform(x)#
- inverse_transform(x)#
- transform(x)#
- class bridgescaler.GroupMinMaxScaler(feature_range=(0, 1))#
Bases:
GroupBaseScalerGroup version of MinMaxScaler
- class bridgescaler.GroupRobustScaler(quartile_range=(25.0, 75.0))#
Bases:
GroupBaseScalerGroup version of RobustScaler
- class bridgescaler.GroupStandardScaler#
Bases:
GroupBaseScalerScaler that enables calculation and sharing of scaling parameters among multiple variables via variable groupings. This is useful for situations where variables are related, such as temperatures at different height levels.
Groups are specified as a list of column ids, which can be column names for pandas dataframes or column indices for numpy arrays.
For example:
` groups = [["a", "b"], ["c", "d"], "e"] `“a” and “b” are a single group and all values of both will be included when calculating the mean and standard deviation for that group.
- bridgescaler.load_scaler(scaler_file)#
Initialize scikit-learn or bridgescaler scaler from saved json file.
- Parameters:
scaler_file – path to json file.
- Returns:
scaler object.
- bridgescaler.load_scaler_dict(scaler_dict_file)#
Loads and deserializes a nested dictionary of Bridgescaler scalers from a JSON file.
- Parameters:
scaler_dict_file (str or Path) – The file path to the JSON file containing the serialized scaler dictionary.
- Returns:
- A nested dictionary of reconstructed scaler objects, with the
same structure as the original dictionary passed to
save_scaler_dict.
- Return type:
dict
- bridgescaler.print_scaler(scaler)#
Output scikit-learn or bridgescaler scaler object to json string.
- Parameters:
scaler – scikit-learn-style scaler object
- Returns:
str representation of object in json format
- bridgescaler.read_scaler(scaler_str)#
Initialize scikit-learn or bridgescaler scaler from json str.
- Parameters:
scaler_str – json str
- Returns:
scaler object.
- bridgescaler.save_scaler(scaler, scaler_file)#
Save a scikit-learn or bridgescaler scaler object to json format.
- Parameters:
scaler – scikit-learn-style scaler object
scaler_file – path to json file where scaler information is stored.
- bridgescaler.save_scaler_dict(scaler_dict, scaler_dict_file)#
Serializes and saves a nested dictionary of Bridgescaler scalers to a JSON file.
- Parameters:
scaler_dict (dict) – A nested dictionary of fitted Bridgescaler scaler objects to be saved.
scaler_dict_file (str or Path) – The file path where the scaler dictionary will be saved as a JSON file.
- bridgescaler.scale_var_dict(var_dict, scalers, method, var_list=None, _key_path=())#
Recursively traverses a nested dict of tensor variables and applies a scaler method to each variable.
- Parameters:
var_dict (dict) – A nested dictionary where leaves are variables in torch.Tensor to be scaled.
scalers (object or dict) – A single scaler instance (for
fitandfit_transform) or a nested dict of scalers matching the structure ofvar_dict(fortransformandinverse_transform).method (str) – The scaler method to apply. Must be one of
fit,transform,inverse_transform, orfit_transform.var_list (list of str, optional) – A list of leaf key names to apply the scaler method to. Keys not in
var_listare skipped duringfit, and left unchanged duringtransform,inverse_transform, andfit_transform. IfNone, all leaf keys are processed.
- Returns:
- A nested dictionary with the same structure as
var_dict, where each leaf is either a fitted scaler (for
fit) or a transformed variable (fortransform,inverse_transform,fit_transform). Keys namedmetadataand keys excluded byvar_listare omitted forfit, and passed through unchanged for other methods.
- A nested dictionary with the same structure as
- Return type:
dict
- Raises:
AssertionError – If
var_dictis not a dict.AssertionError – If
methodis not one of the valid methods.AssertionError – If
scalersis not a dict when usingtransformorinverse_transform.AssertionError – If a key path in
var_dictis missing inscalers.AssertionError – If a scaler at a given key path does not have the requested
method.
Example
>>> import torch >>> from bridgescaler.distributed_tensor import DStandardScalerTensor >>> from bridgescaler.backend import scale_var_dict >>> T = torch.randn((20, 5, 4, 8)) >>> var_dict = { "era5": { "input": {"era5/prognostic/3d/T": T}, "target": {"era5/prognostic/3d/T": T}, "metadata": {"input_datetime": int, "target_datetime": int} } } >>> scalers = DStandardScaler(channels_last=False) >>> scaler_dict = scale_var_dict(var_dict, scalers, method="fit") >>> transformed = scale_var_dict(var_dict, scaler_dict, method="transform") >>> inverse_transformed = scale_var_dict(transformed, scaler_dict, method="inverse_transform") >>> fitted_transformed = scale_var_dict(var_dict, scalers, method="fit_transform") >>> # Only scale specific variables >>> filtered = scale_var_dict(var_dict, scaler_dict, method="transform", var_list=["era5/prognostic/3d/T"])