bridgescaler

bridgescaler#

Submodules#

Classes#

`GroupStandardScaler`	Scaler that enables calculation and sharing of scaling parameters among multiple variables via variable groupings.
`GroupRobustScaler`	Group version of RobustScaler
`GroupMinMaxScaler`	Group version of MinMaxScaler
`DeepStandardScaler`	Calculate standard scaler scores on an arbitrarily dimensional dataset as long as the last dimension is
`DeepMinMaxScaler`
`DeepQuantileTransformer`	Performs a quantile transform on N-dimensional arrays where the variable dimension is the last one.
`DStandardScaler`	Distributed version of StandardScaler. You can calculate this map-reduce style by running it on individual
`DMinMaxScaler`	Distributed MinMaxScaler enables calculation of min and max of variables in datasets in parallel then combining
`DQuantileScaler`	Distributed Quantile Scaler that uses the crick TDigest Cython library to compute quantiles across multiple

Functions#

`save_scaler`(scaler, scaler_file)	Save a scikit-learn or bridgescaler scaler object to json format.
`load_scaler`(scaler_file)	Initialize scikit-learn or bridgescaler scaler from saved json file.
`print_scaler`(scaler)	Output scikit-learn or bridgescaler scaler object to json string.
`read_scaler`(scaler_str)	Initialize scikit-learn or bridgescaler scaler from json str.
`save_scaler_dict`(scaler_dict, scaler_dict_file)	Serializes and saves a nested dictionary of Bridgescaler scalers to a JSON file.
`load_scaler_dict`(scaler_dict_file)	Loads and deserializes a nested dictionary of Bridgescaler scalers from a JSON file.
`scale_var_dict`(var_dict, scalers, method[, var_list, ...])	Recursively traverses a nested dict of tensor variables and applies a scaler method to each variable.

Package Contents#

bridgescaler.save_scaler(scaler, scaler_file)#

Save a scikit-learn or bridgescaler scaler object to json format.

Parameters:

scaler – scikit-learn-style scaler object
scaler_file – path to json file where scaler information is stored.

bridgescaler.load_scaler(scaler_file)#

Initialize scikit-learn or bridgescaler scaler from saved json file.

Parameters:: scaler_file – path to json file.
Returns:: scaler object.

bridgescaler.print_scaler(scaler)#

Output scikit-learn or bridgescaler scaler object to json string.

Parameters:: scaler – scikit-learn-style scaler object
Returns:: str representation of object in json format

bridgescaler.read_scaler(scaler_str)#

Initialize scikit-learn or bridgescaler scaler from json str.

Parameters:: scaler_str – json str
Returns:: scaler object.

bridgescaler.save_scaler_dict(scaler_dict, scaler_dict_file)#

Serializes and saves a nested dictionary of Bridgescaler scalers to a JSON file.

Parameters:

scaler_dict (dict) – A nested dictionary of fitted Bridgescaler scaler objects to be saved.
scaler_dict_file (str or Path) – The file path where the scaler dictionary will be saved as a JSON file.

bridgescaler.load_scaler_dict(scaler_dict_file)#

Loads and deserializes a nested dictionary of Bridgescaler scalers from a JSON file.

Parameters:

scaler_dict_file (str or Path) – The file path to the JSON file containing the serialized scaler dictionary.

Returns:

A nested dictionary of reconstructed scaler objects, with the: same structure as the original dictionary passed to save_scaler_dict.

Return type:

dict

bridgescaler.scale_var_dict(var_dict, scalers, method, var_list=None, _key_path=())#

Recursively traverses a nested dict of tensor variables and applies a scaler method to each variable.

Parameters:

var_dict (dict) – A nested dictionary where leaves are variables in torch.Tensor to be scaled.
scalers (object or dict) – A single scaler instance (for fit and fit_transform) or a nested dict of scalers matching the structure of var_dict (for transform and inverse_transform).
method (str) – The scaler method to apply. Must be one of fit, transform, inverse_transform, or fit_transform.
var_list (list of str, optional) – A list of leaf key names to apply the scaler method to. Keys not in var_list are skipped during fit, and left unchanged during transform, inverse_transform, and fit_transform. If None, all leaf keys are processed.

Returns:

A nested dictionary with the same structure as var_dict,: where each leaf is either a fitted scaler (for fit) or a transformed variable (for transform, inverse_transform, fit_transform). Keys named metadata and keys excluded by var_list are omitted for fit, and passed through unchanged for other methods.

Return type:

dict

Raises:

AssertionError – If var_dict is not a dict.
AssertionError – If method is not one of the valid methods.
AssertionError – If scalers is not a dict when using transform or inverse_transform.
AssertionError – If a key path in var_dict is missing in scalers.
AssertionError – If a scaler at a given key path does not have the requested method.

Example

>>> import torch
>>> from bridgescaler.distributed_tensor import DStandardScalerTensor
>>> from bridgescaler.backend import scale_var_dict
>>> T = torch.randn((20, 5, 4, 8))
>>> var_dict = {
    "era5": {
        "input": {"era5/prognostic/3d/T": T},
        "target": {"era5/prognostic/3d/T": T},
        "metadata": {"input_datetime": int, "target_datetime": int}
        }
    }
>>> scalers = DStandardScaler(channels_last=False)
>>> scaler_dict = scale_var_dict(var_dict, scalers, method="fit")
>>> transformed = scale_var_dict(var_dict, scaler_dict, method="transform")
>>> inverse_transformed = scale_var_dict(transformed, scaler_dict, method="inverse_transform")
>>> fitted_transformed = scale_var_dict(var_dict, scalers, method="fit_transform")
>>> # Only scale specific variables
>>> filtered = scale_var_dict(var_dict, scaler_dict, method="transform", var_list=["era5/prognostic/3d/T"])

class bridgescaler.GroupStandardScaler#

Bases: GroupBaseScaler

Scaler that enables calculation and sharing of scaling parameters among multiple variables via variable groupings. This is useful for situations where variables are related, such as temperatures at different height levels.

Groups are specified as a list of column ids, which can be column names for pandas dataframes or column indices for numpy arrays.

For example: ` groups = [["a", "b"], ["c", "d"], "e"] ` “a” and “b” are a single group and all values of both will be included when calculating the mean and standard deviation for that group.

center_ = None#

scale_ = None#

_fit(x, groups=None)#

_transform_column(x_column, group_index)#

_inverse_transform_column(x_column, group_index)#

class bridgescaler.GroupRobustScaler(quartile_range=(25.0, 75.0))#

Bases: GroupBaseScaler

Group version of RobustScaler

quartile_range = (25.0, 75.0)#

center_ = None#

scale_ = None#

_fit(x, groups)#

_transform_column(x_column, group_index)#

_inverse_transform_column(x_column, group_index)#

class bridgescaler.GroupMinMaxScaler(feature_range=(0, 1))#

Bases: GroupBaseScaler

Group version of MinMaxScaler

feature_range = (0, 1)#

mins_ = None#

maxes_ = None#

_fit(x, groups)#

_transform_column(x_column, group_index)#

_inverse_transform_column(x_column, group_index)#

class bridgescaler.DeepStandardScaler#

Bases: object

Calculate standard scaler scores on an arbitrarily dimensional dataset as long as the last dimension is the variable dimension.

mean_ = None#

sd_ = None#

fit(x)#

transform(x)#

fit_transform(x)#

inverse_transform(x)#

class bridgescaler.DeepMinMaxScaler#

Bases: object

max_ = None#

min_ = None#

fit(x)#

transform(x)#

fit_transform(x)#

inverse_transform(x)#

class bridgescaler.DeepQuantileTransformer(n_quantiles=1000, stochastic=False)#

Bases: object

Performs a quantile transform on N-dimensional arrays where the variable dimension is the last one.

n_quantiles#: number of quantiles to calculate and store

stochastic#: When transforming to quantile space, whether to take the mean of the left and right interpolation values (False) or to pick a random point in between (True).

n_quantiles = 1000#

stochastic = False#

quantiles_ = None#

references_ = None#

fitted_ = False#

x_column_names_ = None#

fit(x)#

transform(x)#

fit_transform(x)#

inverse_transform(x)#

_transform_col(x_col, col_index)#

_inverse_transform_col(x_col, col_index)#

class bridgescaler.DStandardScaler(channels_last=True)#

Bases: DBaseScaler

Distributed version of StandardScaler. You can calculate this map-reduce style by running it on individual data files, return the fitted objects, and then sum them together to represent the full dataset. Scaler supports numpy arrays, pandas dataframes, and xarray DataArrays and will return a transformed array in the same form as the original with column or coordinate names preserved.

mean_x_ = None#

n_ = 0#

var_x_ = None#

fit(x, weight=None)#

transform(x, channels_last=None)#

Transform the input data from its original form to standard scaled form. If your input data has a different dimension order than the data used to fit the scaler, use the channels_last keyword argument to specify whether the new data are channels_last (True) or channels_first (False).

Parameters:

x – Input data.
channels_last – Override the default channels_last parameter of the scaler.

Returns:

Transformed data in the same shape and type as x.

Return type:

x_transformed

inverse_transform(x, channels_last=None)#

get_scales()#

__add__(other)#

class bridgescaler.DMinMaxScaler(channels_last=True)#

Bases: DBaseScaler

Distributed MinMaxScaler enables calculation of min and max of variables in datasets in parallel then combining the mins and maxes as a reduction step. Scaler supports numpy arrays, pandas dataframes, and xarray DataArrays and will return a transformed array in the same form as the original with column or coordinate names preserved.

max_x_ = None#

min_x_ = None#

fit(x, weight=None)#

transform(x, channels_last=None)#

inverse_transform(x, channels_last=None)#

get_scales()#

__add__(other)#

class bridgescaler.DQuantileScaler(compression=250, distribution='uniform', min_val=1e-07, max_val=0.9999999, channels_last=True)#

Bases: DBaseScaler

Distributed Quantile Scaler that uses the crick TDigest Cython library to compute quantiles across multiple datasets in parallel. The library can perform fitting, transforms, and inverse transforms across variables in parallel using the multiprocessing library. Multidimensional arrays are stored in shared memory across processes to minimize inter-process communication.

DQuantileScaler supports

compression#: Recommended number of centroids to use.

distribution#: “uniform”, “normal”, or “logistic”.

min_val#: Minimum value for quantile to prevent -inf results when distribution is normal or logistic.

max_val#: Maximum value for quantile to prevent inf results when distribution is normal or logistic.

channels_last#: Whether to assume the last dim or second dim are the channel/variable dimension.

compression = 250#

distribution = 'uniform'#

min_val = 1e-07#

max_val = 0.9999999#

centroids_ = None#

size_ = None#

min_ = None#

max_ = None#

td_objs_to_attributes(td_objs)#

attributes_to_td_objs()#

fit(x, weight=None)#

transform(x, channels_last=None, pool=None)#

fit_transform(x, channels_last=None, weight=None, pool=None)#

inverse_transform(x, channels_last=None, pool=None)#

__add__(other)#