Data
A collection of functions for loading, preprocessing, and preparing tabular data for machine learning tasks, including handling missing values, encoding features, and creating data loaders.
Functions
def dataname_to_numpy(dataset_name, dataset_path) -> Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]
Loads tabular data from numpy files, including numerical features (N), categorical features (C), labels (y), and dataset metadata.
Parameters:
dataset_name (str) - Name of the dataset.
dataset_path (str) - Path to the dataset directory.
Returns:
Tuple containing: - Numerical features (N) as a dictionary with keys ‘train’, ‘val’, ‘test’ (or None if unavailable). - Categorical features (C) as a dictionary with keys ‘train’, ‘val’, ‘test’ (or None if unavailable). - Labels (y) as a dictionary with keys ‘train’, ‘val’, ‘test’. - Dataset metadata from ‘info.json’.
def get_dataset(dataset_name, dataset_path) -> Tuple[Tuple[ArrayDict, ArrayDict, ArrayDict], Tuple[ArrayDict, ArrayDict, ArrayDict], Dict[str, Any]]
Splits loaded data into training/validation and test sets.
Parameters:
dataset_name (str) - Name of the dataset.
dataset_path (str) - Path to the dataset directory.
Returns:
Tuple containing: - Training/validation data (numerical, categorical, labels). - Test data (numerical, categorical, labels). - Dataset metadata.
def data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None) -> Tuple[ArrayDict, ArrayDict, Optional[np.ndarray], Optional[SimpleImputer], Optional[str]]
Processes missing values in numerical and categorical features.
Parameters:
N_data (ArrayDict) - Numerical features (may contain NaNs).
C_data (ArrayDict) - Categorical features (may contain NaNs).
num_nan_policy (str) - Strategy for numerical NaNs (‘mean’ or ‘median’).
cat_nan_policy (str) - Strategy for categorical NaNs (‘new’ or ‘most_frequent’).
num_new_value (Optional[np.ndarray]) - Precomputed values to fill numerical NaNs.
imputer (Optional[SimpleImputer]) - Pre-fit imputer for categorical features.
cat_new_value (Optional[str]) - Value to fill categorical NaNs (for ‘new’ policy).
Returns:
Tuple containing: - Processed numerical features. - Processed categorical features. - Values used to fill numerical NaNs. - Fitted imputer for categorical features (if used). - Value used to fill categorical NaNs (if used).
def num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None) -> Tuple[ArrayDict, Optional[Union[PiecewiseLinearEncoding, UnaryEncoding, BinsEncoding, JohnsonEncoding]]]
Encodes numerical features using various strategies (e.g., piecewise linear, unary, bins).
Parameters:
N_data (ArrayDict) - Numerical features to encode.
num_policy (str) - Encoding strategy (e.g., ‘Q_PLE’ for quantile-based piecewise linear encoding).
n_bins (int, optional, Default is 2) - Number of bins for discretization.
y_train (Optional[np.ndarray]) - Training labels (for target-based encoding).
is_regression (bool, optional, Default is False) - Whether the task is regression.
encoder (Optional) - Pre-fit encoder (if None, fits a new one).
Returns:
Tuple containing: - Encoded numerical features. - Fitted encoder.
def data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None) -> Tuple[ArrayDict, ArrayDict, Optional[OrdinalEncoder], Optional[List[int]], Optional[OneHotEncoder]]
Encodes categorical features using various strategies (e.g., one-hot, target encoding) and handles unknown categories.
Parameters:
N_data (ArrayDict) - Numerical data (or None).
C_data (ArrayDict) - Categorical data (or None).
cat_policy (str) - Encoding strategy: - indices: Return ordinal indices without further encoding. - ordinal: Use ordinal encoding. - ohe/tabr_ohe: One-hot encoding (with tabr_ohe for TabR compatibility). - binary: Binary encoding (from category_encoders). - hash: Hashing encoding (from category_encoders). - loo: Leave-one-out encoding (supervised, from category_encoders). - target: Target encoding (supervised, from category_encoders). - catboost: CatBoost encoding (supervised, from category_encoders).
y_train (Optional[np.ndarray]) - Training labels (for supervised encodings).
ord_encoder (Optional[OrdinalEncoder]) - Pre-fitted ordinal encoder.
mode_values (Optional[List[int]]) - Mode values for replacing unknown categories in validation/test sets.
cat_encoder (Optional) - Pre-fitted categorical encoder (e.g., OneHotEncoder).
Returns:
Tuple containing: - Processed numerical data (merged with encoded categoricals if applicable). - Unused (returns None if categoricals are merged into numerical data). - Fitted ordinal encoder. - Mode values for unknown categories. - Fitted categorical encoder.
def data_norm_process(N_data, normalization, seed, normalizer=None) -> Tuple[ArrayDict, Optional[TransformerMixin]]
Applies normalization to numerical features.
Parameters:
N_data (ArrayDict) - Numerical data (or None).
normalization (str) - Normalization strategy: - standard: StandardScaler (mean=0, std=1). - minmax: MinMaxScaler (scales to [0, 1]). - quantile: QuantileTransformer (normalizes to Gaussian distribution). - maxabs: MaxAbsScaler (scales by maximum absolute value). - power: PowerTransformer (Yeo-Johnson transformation). - robust: RobustScaler (resistant to outliers). - none: No normalization.
seed (int) - Random seed for reproducibility (used in QuantileTransformer).
normalizer (Optional[TransformerMixin]) - Pre-fitted normalizer.
Returns:
Tuple containing: - Normalized numerical data. - Fitted normalizer.
def data_label_process(y_data, is_regression, info=None, encoder=None) -> Tuple[ArrayDict, Dict[str, Any], Optional[LabelEncoder]]
Processes labels for regression or classification tasks.
Parameters:
y_data (ArrayDict) - Label data.
is_regression (bool) - Whether the task is regression.
info (Optional[Dict[str, Any]]) - Precomputed label statistics (mean, std for regression; classes for classification).
encoder (Optional[LabelEncoder]) - Pre-fitted label encoder (for classification).
Returns:
Tuple containing: - Processed labels (standardized for regression; encoded for classification). - Metadata (mean/std for regression; classes for classification). - Fitted label encoder (for classification).
def data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False) -> Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, DataLoader, Callable] or Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, Callable]
Prepares PyTorch DataLoaders for training/validation or test data, with proper type casting and device placement.
Parameters:
is_regression (bool) - Whether the task is regression (vs. classification).
X (Tuple[ArrayDict, ArrayDict]) - Tuple of numerical and categorical data (each as ArrayDict).
Y (ArrayDict) - Label data.
y_info (Dict[str, Any]) - Metadata about labels (e.g., mean/std for regression).
device (torch.device) - Target device (CPU/GPU) for data.
batch_size (int) - Batch size for the DataLoader.
is_train (bool) - If True, creates training and validation loaders; if False, creates a test loader.
is_float (bool, optional, Default is False) - If True, casts data to float32; otherwise uses float64.
Returns:
If is_train=True: - Tuple containing:
Processed numerical data (on device).
Processed categorical data (on device).
Processed labels (on device).
Training DataLoader.
Validation DataLoader.
Loss function (MSE for regression, cross-entropy for classification).
If is_train=False: - Tuple containing:
Processed numerical data (on device).
Processed categorical data (on device).
Processed labels (on device).
Test DataLoader.
Loss function.
def to_tensors(data: ArrayDict) -> Dict[str, torch.Tensor]
Converts numpy arrays in an ArrayDict to PyTorch tensors.
Parameters:
data (ArrayDict) - Dictionary with keys like ‘train’, ‘val’, ‘test’ and numpy array values.
Returns:
Dict[str, torch.Tensor] - Dictionary with the same keys, where numpy arrays are converted to PyTorch tensors.
def get_categories(X_cat: Optional[Dict[str, torch.Tensor]]) -> Optional[List[int]]
Computes the number of unique categories for each categorical feature.
Parameters:
X_cat (Optional[Dict[str, torch.Tensor]]) - Categorical data (keys: ‘train’, etc.; values: tensors of shape (n_samples, n_features)).
Returns:
Optional[List[int]] - List where each element is the number of unique categories for the corresponding feature. Returns None if X_cat is None.
class Dataset
A dataclass for storing tabular dataset information.
Fields:
N (Optional[ArrayDict]) - Numerical features (or None if not available).
C (Optional[ArrayDict]) - Categorical features (or None if not available).
y (ArrayDict) - Labels for all splits.
info (Dict[str, Any]) - Dataset metadata.
Properties:
is_binclass (bool) - Whether the task is binary classification.
is_multiclass (bool) - Whether the task is multiclass classification.
is_regression (bool) - Whether the task is regression.
n_num_features (int) - Number of numerical features.
n_cat_features (int) - Number of categorical features.
n_features (int) - Total number of features.