Data

A collection of functions for loading, preprocessing, and preparing tabular data for machine learning tasks, including handling missing values, encoding features, and creating data loaders.

Functions

def dataname_to_numpy(dataset_name, dataset_path) -> Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]

Loads tabular data from numpy files, including numerical features (N), categorical features (C), labels (y), and dataset metadata.

Parameters:

  • dataset_name (str) - Name of the dataset.

  • dataset_path (str) - Path to the dataset directory.

Returns:

  • Tuple containing: - Numerical features (N) as a dictionary with keys ‘train’, ‘val’, ‘test’ (or None if unavailable). - Categorical features (C) as a dictionary with keys ‘train’, ‘val’, ‘test’ (or None if unavailable). - Labels (y) as a dictionary with keys ‘train’, ‘val’, ‘test’. - Dataset metadata from ‘info.json’.

def get_dataset(dataset_name, dataset_path) -> Tuple[Tuple[ArrayDict, ArrayDict, ArrayDict], Tuple[ArrayDict, ArrayDict, ArrayDict], Dict[str, Any]]

Splits loaded data into training/validation and test sets.

Parameters:

  • dataset_name (str) - Name of the dataset.

  • dataset_path (str) - Path to the dataset directory.

Returns:

  • Tuple containing: - Training/validation data (numerical, categorical, labels). - Test data (numerical, categorical, labels). - Dataset metadata.

def data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None) -> Tuple[ArrayDict, ArrayDict, Optional[np.ndarray], Optional[SimpleImputer], Optional[str]]

Processes missing values in numerical and categorical features.

Parameters:

  • N_data (ArrayDict) - Numerical features (may contain NaNs).

  • C_data (ArrayDict) - Categorical features (may contain NaNs).

  • num_nan_policy (str) - Strategy for numerical NaNs (‘mean’ or ‘median’).

  • cat_nan_policy (str) - Strategy for categorical NaNs (‘new’ or ‘most_frequent’).

  • num_new_value (Optional[np.ndarray]) - Precomputed values to fill numerical NaNs.

  • imputer (Optional[SimpleImputer]) - Pre-fit imputer for categorical features.

  • cat_new_value (Optional[str]) - Value to fill categorical NaNs (for ‘new’ policy).

Returns:

  • Tuple containing: - Processed numerical features. - Processed categorical features. - Values used to fill numerical NaNs. - Fitted imputer for categorical features (if used). - Value used to fill categorical NaNs (if used).

def num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None) -> Tuple[ArrayDict, Optional[Union[PiecewiseLinearEncoding, UnaryEncoding, BinsEncoding, JohnsonEncoding]]]

Encodes numerical features using various strategies (e.g., piecewise linear, unary, bins).

Parameters:

  • N_data (ArrayDict) - Numerical features to encode.

  • num_policy (str) - Encoding strategy (e.g., ‘Q_PLE’ for quantile-based piecewise linear encoding).

  • n_bins (int, optional, Default is 2) - Number of bins for discretization.

  • y_train (Optional[np.ndarray]) - Training labels (for target-based encoding).

  • is_regression (bool, optional, Default is False) - Whether the task is regression.

  • encoder (Optional) - Pre-fit encoder (if None, fits a new one).

Returns:

  • Tuple containing: - Encoded numerical features. - Fitted encoder.

def data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None) -> Tuple[ArrayDict, ArrayDict, Optional[OrdinalEncoder], Optional[List[int]], Optional[OneHotEncoder]]

Encodes categorical features using various strategies (e.g., one-hot, target encoding) and handles unknown categories.

Parameters:

  • N_data (ArrayDict) - Numerical data (or None).

  • C_data (ArrayDict) - Categorical data (or None).

  • cat_policy (str) - Encoding strategy: - indices: Return ordinal indices without further encoding. - ordinal: Use ordinal encoding. - ohe/tabr_ohe: One-hot encoding (with tabr_ohe for TabR compatibility). - binary: Binary encoding (from category_encoders). - hash: Hashing encoding (from category_encoders). - loo: Leave-one-out encoding (supervised, from category_encoders). - target: Target encoding (supervised, from category_encoders). - catboost: CatBoost encoding (supervised, from category_encoders).

  • y_train (Optional[np.ndarray]) - Training labels (for supervised encodings).

  • ord_encoder (Optional[OrdinalEncoder]) - Pre-fitted ordinal encoder.

  • mode_values (Optional[List[int]]) - Mode values for replacing unknown categories in validation/test sets.

  • cat_encoder (Optional) - Pre-fitted categorical encoder (e.g., OneHotEncoder).

Returns:

  • Tuple containing: - Processed numerical data (merged with encoded categoricals if applicable). - Unused (returns None if categoricals are merged into numerical data). - Fitted ordinal encoder. - Mode values for unknown categories. - Fitted categorical encoder.

def data_norm_process(N_data, normalization, seed, normalizer=None) -> Tuple[ArrayDict, Optional[TransformerMixin]]

Applies normalization to numerical features.

Parameters:

  • N_data (ArrayDict) - Numerical data (or None).

  • normalization (str) - Normalization strategy: - standard: StandardScaler (mean=0, std=1). - minmax: MinMaxScaler (scales to [0, 1]). - quantile: QuantileTransformer (normalizes to Gaussian distribution). - maxabs: MaxAbsScaler (scales by maximum absolute value). - power: PowerTransformer (Yeo-Johnson transformation). - robust: RobustScaler (resistant to outliers). - none: No normalization.

  • seed (int) - Random seed for reproducibility (used in QuantileTransformer).

  • normalizer (Optional[TransformerMixin]) - Pre-fitted normalizer.

Returns:

  • Tuple containing: - Normalized numerical data. - Fitted normalizer.

def data_label_process(y_data, is_regression, info=None, encoder=None) -> Tuple[ArrayDict, Dict[str, Any], Optional[LabelEncoder]]

Processes labels for regression or classification tasks.

Parameters:

  • y_data (ArrayDict) - Label data.

  • is_regression (bool) - Whether the task is regression.

  • info (Optional[Dict[str, Any]]) - Precomputed label statistics (mean, std for regression; classes for classification).

  • encoder (Optional[LabelEncoder]) - Pre-fitted label encoder (for classification).

Returns:

  • Tuple containing: - Processed labels (standardized for regression; encoded for classification). - Metadata (mean/std for regression; classes for classification). - Fitted label encoder (for classification).

def data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False) -> Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, DataLoader, Callable] or Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, Callable]

Prepares PyTorch DataLoaders for training/validation or test data, with proper type casting and device placement.

Parameters:

  • is_regression (bool) - Whether the task is regression (vs. classification).

  • X (Tuple[ArrayDict, ArrayDict]) - Tuple of numerical and categorical data (each as ArrayDict).

  • Y (ArrayDict) - Label data.

  • y_info (Dict[str, Any]) - Metadata about labels (e.g., mean/std for regression).

  • device (torch.device) - Target device (CPU/GPU) for data.

  • batch_size (int) - Batch size for the DataLoader.

  • is_train (bool) - If True, creates training and validation loaders; if False, creates a test loader.

  • is_float (bool, optional, Default is False) - If True, casts data to float32; otherwise uses float64.

Returns:

  • If is_train=True: - Tuple containing:

    • Processed numerical data (on device).

    • Processed categorical data (on device).

    • Processed labels (on device).

    • Training DataLoader.

    • Validation DataLoader.

    • Loss function (MSE for regression, cross-entropy for classification).

  • If is_train=False: - Tuple containing:

    • Processed numerical data (on device).

    • Processed categorical data (on device).

    • Processed labels (on device).

    • Test DataLoader.

    • Loss function.

def to_tensors(data: ArrayDict) -> Dict[str, torch.Tensor]

Converts numpy arrays in an ArrayDict to PyTorch tensors.

Parameters:

  • data (ArrayDict) - Dictionary with keys like ‘train’, ‘val’, ‘test’ and numpy array values.

Returns:

  • Dict[str, torch.Tensor] - Dictionary with the same keys, where numpy arrays are converted to PyTorch tensors.

def get_categories(X_cat: Optional[Dict[str, torch.Tensor]]) -> Optional[List[int]]

Computes the number of unique categories for each categorical feature.

Parameters:

  • X_cat (Optional[Dict[str, torch.Tensor]]) - Categorical data (keys: ‘train’, etc.; values: tensors of shape (n_samples, n_features)).

Returns:

  • Optional[List[int]] - List where each element is the number of unique categories for the corresponding feature. Returns None if X_cat is None.

class Dataset

A dataclass for storing tabular dataset information.

Fields:

  • N (Optional[ArrayDict]) - Numerical features (or None if not available).

  • C (Optional[ArrayDict]) - Categorical features (or None if not available).

  • y (ArrayDict) - Labels for all splits.

  • info (Dict[str, Any]) - Dataset metadata.

Properties:

  • is_binclass (bool) - Whether the task is binary classification.

  • is_multiclass (bool) - Whether the task is multiclass classification.

  • is_regression (bool) - Whether the task is regression.

  • n_num_features (int) - Number of numerical features.

  • n_cat_features (int) - Number of categorical features.

  • n_features (int) - Total number of features.