bridgescaler.distributed_tensor

bridgescaler.distributed_tensor#

Attributes#

CENTROID_DTYPE

Classes#

`DBaseScalerTensor`	Base distributed scaler class for torch.Tensor. Used only to store attributes and methods
`DStandardScalerTensor`	Distributed version of StandardScaler. You can calculate this map-reduce style by running it on individual
`DMinMaxScalerTensor`	Distributed MinMaxScaler enables calculation of min and max of variables in datasets in parallel, then combining
`DQuantileScalerTensor`	Distributed Quantile Scaler for tensors that uses the crick TDigest Cython library to compute quantiles across multiple

Functions#

`fit_variable_tensor`(var_index, xv[, compression, ...])
`transform_variable_tensor`(cent_mean, cent_weight, ...)
`inv_transform_variable_tensor`(cent_mean, cent_weight, ...)
`tdigest_cdf_tensor`(xv, cent_mean, cent_weight, t_min, ...)
`tdigest_quantile_tensor`(qv, cent_mean, cent_weight, ...)

Module Contents#

bridgescaler.distributed_tensor.CENTROID_DTYPE#

class bridgescaler.distributed_tensor.DBaseScalerTensor(channels_last=True)#

Base distributed scaler class for torch.Tensor. Used only to store attributes and methods shared across all distributed scaler subclasses.

x_columns_ = None#

_fit = False#

channels_last = True#

is_fit()#

extract_x_columns(x, channels_last=True)#

Extract the variable names from input x.

The variable names are expected to be stored in the variable_names attribute of the torch.Tensor. If the attribute is missing, a warning is issued to notify the user that alignment validation will be limited.

Parameters:

x (torch.Tensor) – The input tensor containing data and optionally the variable_names attribute.
channels_last (bool) – If True, then assume the variable or channel dimension is the last dimension of the array. If False, then assume the variable or channel dimension is second.

Returns:

Variable names if available; otherwise,: integer indices generated based on the length of the variable/channel dimension.

Return type:

x_columns (list[str] | list[int])

Raises:

TypeError – If x is not a torch.Tensor or if variable_names is not a list.
ValueError – If variable_names contains duplicate entries.

static extract_array(x)#

get_column_order(x_in_columns)#

Get the indices of the scaler columns that have the same name as the variables (columns) in the input x tensor. This enables users to pass a torch.Tensor to transform or inverse_transform with fewer variables than the original scaler or variables in a different order and still have the input dataset be transformed properly.

Parameters:: x_in_columns (list) – list of input variable names.
Returns:: integer indices of the input variables from x in the scaler in order.
Return type:: x_in_col_indices (list)

static package_transformed_x(x_transformed, x)#

Repackaged a transformed torch.Tensor into the same datatype as the original x, including all metadata.

Parameters:

x_transformed (torch.Tensor) – array after being transformed or inverse transformed
x (torch.Tensor) – original data

Returns:

set_channel_dim(channels_last=None)#

process_x_for_transform(x, channels_last=None)#

fit(x, weight=None)#

transform(x, channels_last=None)#

fit_transform(x, channels_last=None, weight=None)#

inverse_transform(x, channels_last=None)#

__add__(other)#

subset_columns(sel_columns)#

add_variables(other)#

static reshape_to_channels_first(stat, target)#: Reshapes ‘stat’ to align with the channel dimension (index 1).

static reshape_to_channels_last(stat, target)#: Reshapes ‘stat’ to align with the last dimension.

class bridgescaler.distributed_tensor.DStandardScalerTensor(channels_last=True)#

Bases: DBaseScalerTensor

Distributed version of StandardScaler. You can calculate this map-reduce style by running it on individual data files, returning the fitted objects, and then summing them together to represent the full dataset. Scaler supports torch.Tensor and returns a transformed tensor.

mean_x_ = None#

n_ = 0#

var_x_ = None#

fit(x, weight=None)#

transform(x, channels_last=None)#

Transform the input data from its original form to standard scaled form. If your input data has a different dimension order than the data used to fit the scaler, use the channels_last keyword argument to specify whether the new data are channels_last (True) or channels_first (False).

Parameters:

x (torch.Tensor) – Input data.
channels_last – Override the default channels_last parameter of the scaler.

Returns:

Transformed data in the same shape and type as x.

Return type:

x_transformed (torch.Tensor)

inverse_transform(x, channels_last=None)#

get_scales(x_col_order=slice(None))#

__add__(other)#

class bridgescaler.distributed_tensor.DMinMaxScalerTensor(channels_last=True)#

Bases: DBaseScalerTensor

Distributed MinMaxScaler enables calculation of min and max of variables in datasets in parallel, then combining the mins and maxes as a reduction step. Scaler supports torch.Tensor and will return a transformed tensor in the same form as the original with variable/column names preserved.

max_x_ = None#

min_x_ = None#

fit(x, weight=None)#

transform(x, channels_last=None)#

inverse_transform(x, channels_last=None)#

get_scales(x_col_order=slice(None))#

__add__(other)#

bridgescaler.distributed_tensor.fit_variable_tensor(var_index, xv, compression=None, channels_last=None)#

bridgescaler.distributed_tensor.transform_variable_tensor(cent_mean, cent_weight, t_min, t_max, xv, min_val=1e-06, max_val=0.9999999, distribution='normal')#

bridgescaler.distributed_tensor.inv_transform_variable_tensor(cent_mean, cent_weight, t_min, t_max, xv, distribution='normal')#

bridgescaler.distributed_tensor.tdigest_cdf_tensor(xv, cent_mean, cent_weight, t_min, t_max)#

bridgescaler.distributed_tensor.tdigest_quantile_tensor(qv, cent_mean, cent_weight, t_min, t_max)#

class bridgescaler.distributed_tensor.DQuantileScalerTensor(compression=250, distribution='uniform', min_val=1e-07, max_val=0.9999999, channels_last=True)#

Bases: DBaseScalerTensor

Distributed Quantile Scaler for tensors that uses the crick TDigest Cython library to compute quantiles across multiple datasets in parallel. The library can perform fitting, transforms, and inverse transforms.

DQuantileScaler supports

compression#: Recommended number of centroids to use.

distribution#: “uniform”, “normal”, or “logistic”.

min_val#: Minimum value for quantile to prevent -inf results when distribution is normal or logistic.

max_val#: Maximum value for quantile to prevent inf results when distribution is normal or logistic.

channels_last#: Whether to assume the last dim or second dim are the channel/variable dimension.

compression = 250#

distribution = 'uniform'#

min_val = 1e-07#

max_val = 0.9999999#

centroids_ = None#

size_ = None#

min_ = None#

max_ = None#

centroids_mean_tensor = None#

centroids_weight_tensor = None#

min_tensor = None#

max_tensor = None#

td_objs_to_attributes(td_objs)#

attributes_to_td_objs()#

tensorize_attributes()#

fit(x, weight=None)#

transform(x, channels_last=None)#

fit_transform(x, channels_last=None, weight=None)#

inverse_transform(x, channels_last=None)#

__add__(other)#