**Data** ======== A collection of functions for loading, preprocessing, and preparing tabular data for machine learning tasks, including handling missing values, encoding features, and creating data loaders. Functions ~~~~~~~~~ .. code-block:: python def dataname_to_numpy(dataset_name, dataset_path) -> Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]] Loads tabular data from numpy files, including numerical features (N), categorical features (C), labels (y), and dataset metadata. **Parameters:** * **dataset_name** *(str)* - Name of the dataset. * **dataset_path** *(str)* - Path to the dataset directory. **Returns:** * **Tuple** containing: - Numerical features (N) as a dictionary with keys 'train', 'val', 'test' (or None if unavailable). - Categorical features (C) as a dictionary with keys 'train', 'val', 'test' (or None if unavailable). - Labels (y) as a dictionary with keys 'train', 'val', 'test'. - Dataset metadata from 'info.json'. .. code-block:: python def get_dataset(dataset_name, dataset_path) -> Tuple[Tuple[ArrayDict, ArrayDict, ArrayDict], Tuple[ArrayDict, ArrayDict, ArrayDict], Dict[str, Any]] Splits loaded data into training/validation and test sets. **Parameters:** * **dataset_name** *(str)* - Name of the dataset. * **dataset_path** *(str)* - Path to the dataset directory. **Returns:** * **Tuple** containing: - Training/validation data (numerical, categorical, labels). - Test data (numerical, categorical, labels). - Dataset metadata. .. code-block:: python def data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None) -> Tuple[ArrayDict, ArrayDict, Optional[np.ndarray], Optional[SimpleImputer], Optional[str]] Processes missing values in numerical and categorical features. **Parameters:** * **N_data** *(ArrayDict)* - Numerical features (may contain NaNs). * **C_data** *(ArrayDict)* - Categorical features (may contain NaNs). * **num_nan_policy** *(str)* - Strategy for numerical NaNs ('mean' or 'median'). * **cat_nan_policy** *(str)* - Strategy for categorical NaNs ('new' or 'most_frequent'). * **num_new_value** *(Optional[np.ndarray])* - Precomputed values to fill numerical NaNs. * **imputer** *(Optional[SimpleImputer])* - Pre-fit imputer for categorical features. * **cat_new_value** *(Optional[str])* - Value to fill categorical NaNs (for 'new' policy). **Returns:** * **Tuple** containing: - Processed numerical features. - Processed categorical features. - Values used to fill numerical NaNs. - Fitted imputer for categorical features (if used). - Value used to fill categorical NaNs (if used). .. code-block:: python def num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None) -> Tuple[ArrayDict, Optional[Union[PiecewiseLinearEncoding, UnaryEncoding, BinsEncoding, JohnsonEncoding]]] Encodes numerical features using various strategies (e.g., piecewise linear, unary, bins). **Parameters:** * **N_data** *(ArrayDict)* - Numerical features to encode. * **num_policy** *(str)* - Encoding strategy (e.g., 'Q_PLE' for quantile-based piecewise linear encoding). * **n_bins** *(int, optional, Default is 2)* - Number of bins for discretization. * **y_train** *(Optional[np.ndarray])* - Training labels (for target-based encoding). * **is_regression** *(bool, optional, Default is False)* - Whether the task is regression. * **encoder** *(Optional)* - Pre-fit encoder (if None, fits a new one). **Returns:** * **Tuple** containing: - Encoded numerical features. - Fitted encoder. .. code-block:: python def data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None) -> Tuple[ArrayDict, ArrayDict, Optional[OrdinalEncoder], Optional[List[int]], Optional[OneHotEncoder]] Encodes categorical features using various strategies (e.g., one-hot, target encoding) and handles unknown categories. **Parameters:** * **N_data** *(ArrayDict)* - Numerical data (or None). * **C_data** *(ArrayDict)* - Categorical data (or None). * **cat_policy** *(str)* - Encoding strategy: - `indices`: Return ordinal indices without further encoding. - `ordinal`: Use ordinal encoding. - `ohe`/`tabr_ohe`: One-hot encoding (with `tabr_ohe` for TabR compatibility). - `binary`: Binary encoding (from `category_encoders`). - `hash`: Hashing encoding (from `category_encoders`). - `loo`: Leave-one-out encoding (supervised, from `category_encoders`). - `target`: Target encoding (supervised, from `category_encoders`). - `catboost`: CatBoost encoding (supervised, from `category_encoders`). * **y_train** *(Optional[np.ndarray])* - Training labels (for supervised encodings). * **ord_encoder** *(Optional[OrdinalEncoder])* - Pre-fitted ordinal encoder. * **mode_values** *(Optional[List[int]])* - Mode values for replacing unknown categories in validation/test sets. * **cat_encoder** *(Optional)* - Pre-fitted categorical encoder (e.g., `OneHotEncoder`). **Returns:** * **Tuple** containing: - Processed numerical data (merged with encoded categoricals if applicable). - Unused (returns None if categoricals are merged into numerical data). - Fitted ordinal encoder. - Mode values for unknown categories. - Fitted categorical encoder. .. code-block:: python def data_norm_process(N_data, normalization, seed, normalizer=None) -> Tuple[ArrayDict, Optional[TransformerMixin]] Applies normalization to numerical features. **Parameters:** * **N_data** *(ArrayDict)* - Numerical data (or None). * **normalization** *(str)* - Normalization strategy: - `standard`: StandardScaler (mean=0, std=1). - `minmax`: MinMaxScaler (scales to [0, 1]). - `quantile`: QuantileTransformer (normalizes to Gaussian distribution). - `maxabs`: MaxAbsScaler (scales by maximum absolute value). - `power`: PowerTransformer (Yeo-Johnson transformation). - `robust`: RobustScaler (resistant to outliers). - `none`: No normalization. * **seed** *(int)* - Random seed for reproducibility (used in `QuantileTransformer`). * **normalizer** *(Optional[TransformerMixin])* - Pre-fitted normalizer. **Returns:** * **Tuple** containing: - Normalized numerical data. - Fitted normalizer. .. code-block:: python def data_label_process(y_data, is_regression, info=None, encoder=None) -> Tuple[ArrayDict, Dict[str, Any], Optional[LabelEncoder]] Processes labels for regression or classification tasks. **Parameters:** * **y_data** *(ArrayDict)* - Label data. * **is_regression** *(bool)* - Whether the task is regression. * **info** *(Optional[Dict[str, Any]])* - Precomputed label statistics (mean, std for regression; classes for classification). * **encoder** *(Optional[LabelEncoder])* - Pre-fitted label encoder (for classification). **Returns:** * **Tuple** containing: - Processed labels (standardized for regression; encoded for classification). - Metadata (mean/std for regression; classes for classification). - Fitted label encoder (for classification). .. code-block:: python def data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False) -> Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, DataLoader, Callable] or Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, Callable] Prepares PyTorch DataLoaders for training/validation or test data, with proper type casting and device placement. **Parameters:** * **is_regression** *(bool)* - Whether the task is regression (vs. classification). * **X** *(Tuple[ArrayDict, ArrayDict])* - Tuple of numerical and categorical data (each as `ArrayDict`). * **Y** *(ArrayDict)* - Label data. * **y_info** *(Dict[str, Any])* - Metadata about labels (e.g., mean/std for regression). * **device** *(torch.device)* - Target device (CPU/GPU) for data. * **batch_size** *(int)* - Batch size for the DataLoader. * **is_train** *(bool)* - If True, creates training and validation loaders; if False, creates a test loader. * **is_float** *(bool, optional, Default is False)* - If True, casts data to `float32`; otherwise uses `float64`. **Returns:** * If `is_train=True`: - Tuple containing: - Processed numerical data (on device). - Processed categorical data (on device). - Processed labels (on device). - Training DataLoader. - Validation DataLoader. - Loss function (MSE for regression, cross-entropy for classification). * If `is_train=False`: - Tuple containing: - Processed numerical data (on device). - Processed categorical data (on device). - Processed labels (on device). - Test DataLoader. - Loss function. .. code-block:: python def to_tensors(data: ArrayDict) -> Dict[str, torch.Tensor] Converts numpy arrays in an `ArrayDict` to PyTorch tensors. **Parameters:** * **data** *(ArrayDict)* - Dictionary with keys like `'train'`, `'val'`, `'test'` and numpy array values. **Returns:** * **Dict[str, torch.Tensor]** - Dictionary with the same keys, where numpy arrays are converted to PyTorch tensors. .. code-block:: python def get_categories(X_cat: Optional[Dict[str, torch.Tensor]]) -> Optional[List[int]] Computes the number of unique categories for each categorical feature. **Parameters:** * **X_cat** *(Optional[Dict[str, torch.Tensor]])* - Categorical data (keys: `'train'`, etc.; values: tensors of shape `(n_samples, n_features)`). **Returns:** * **Optional[List[int]]** - List where each element is the number of unique categories for the corresponding feature. Returns `None` if `X_cat` is `None`. .. code-block:: python class Dataset A dataclass for storing tabular dataset information. **Fields:** * **N** *(Optional[ArrayDict])* - Numerical features (or None if not available). * **C** *(Optional[ArrayDict])* - Categorical features (or None if not available). * **y** *(ArrayDict)* - Labels for all splits. * **info** *(Dict[str, Any])* - Dataset metadata. **Properties:** * **is_binclass** *(bool)* - Whether the task is binary classification. * **is_multiclass** *(bool)* - Whether the task is multiclass classification. * **is_regression** *(bool)* - Whether the task is regression. * **n_num_features** *(int)* - Number of numerical features. * **n_cat_features** *(int)* - Number of categorical features. * **n_features** *(int)* - Total number of features.