Data
=====

.. automodule:: TALENT.model.lib.data
   :members:
   :undoc-members:
   :show-inheritance:

Data Structures
---------------

.. class:: Dataset
   :noindex:

   A comprehensive data structure for representing tabular datasets with numerical and categorical features.
   
   **Attributes:**
   
   * **N** (*Optional[ArrayDict]*) -- Numerical features dictionary with 'train', 'val', 'test' keys
   * **C** (*Optional[ArrayDict]*) -- Categorical features dictionary with 'train', 'val', 'test' keys  
   * **y** (*ArrayDict*) -- Target labels dictionary with 'train', 'val', 'test' keys
   * **info** (*Dict[str, Any]*) -- Dataset metadata including task type and feature counts

**Properties:**

   .. property:: is_binclass
      :noindex:
      
      Check if the dataset is for binary classification.
      
      **Returns:**
      
      * **bool** -- True if task_type is 'binclass'
   
   .. property:: is_multiclass
      :noindex:
      
      Check if the dataset is for multi-class classification.
      
      **Returns:**
      
      * **bool** -- True if task_type is 'multiclass'
   
   .. property:: is_regression
      :noindex:
      
      Check if the dataset is for regression.
      
      **Returns:**
      
      * **bool** -- True if task_type is 'regression'
   
   .. property:: n_num_features
      :noindex:
      
      Get the number of numerical features.
      
      **Returns:**
      
      * **int** -- Number of numerical features
   
   .. property:: n_cat_features
      :noindex:
      
      Get the number of categorical features.
      
      **Returns:**
      
      * **int** -- Number of categorical features
   
   .. property:: n_features
      :noindex:
      
      Get the total number of features.
      
      **Returns:**
      
      * **int** -- Total number of features (numerical + categorical)
   
   .. method:: size(part)
      :noindex:
      
      Get the size of a specific dataset partition.
      
      **Parameters:**
      
      * **part** (*str*) -- Dataset partition ('train', 'val', or 'test')
      
      **Returns:**
      
      * **int** -- Number of samples in the specified partition

Data Loading
------------

.. function:: dataname_to_numpy(dataset_name, dataset_path)
   :noindex:

   Load dataset from numpy files stored in the specified directory structure.
   
   **Parameters:**
   
   * **dataset_name** (*str*) -- Name of the dataset directory
   * **dataset_path** (*str*) -- Base path to the dataset directory
   
   **Returns:**
   
   * **Tuple** -- (N, C, y, info) where:
     * N: Numerical features dictionary or None
     * C: Categorical features dictionary or None  
     * y: Target labels dictionary
     * info: Dataset metadata from info.json
   
   **Expected Directory Structure:**
   
   .. code-block:: text
      
      dataset_path/dataset_name/
      ├── N_train.npy (optional)
      ├── N_val.npy (optional)
      ├── N_test.npy (optional)
      ├── C_train.npy (optional)
      ├── C_val.npy (optional)
      ├── C_test.npy (optional)
      ├── y_train.npy
      ├── y_val.npy
      ├── y_test.npy
      └── info.json

.. function:: get_dataset(dataset_name, dataset_path)
   :noindex:

   Load and split dataset into training/validation and test sets.
   
   **Parameters:**
   
   * **dataset_name** (*str*) -- Name of the dataset directory
   * **dataset_path** (*str*) -- Base path to the dataset directory
   
   **Returns:**
   
   * **Tuple** -- (train_val_data, test_data, info) where:
     * train_val_data: Tuple of (N_trainval, C_trainval, y_trainval)
     * test_data: Tuple of (N_test, C_test, y_test)
     * info: Dataset metadata

.. function:: load_json(path)
   :noindex:

   Load and parse a JSON file.
   
   **Parameters:**
   
   * **path** (*str*) -- Path to the JSON file
   
   **Returns:**
   
   * **dict** -- Parsed JSON content

Data Preprocessing
------------------

.. function:: data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None)
   :noindex:

   Handle missing values (NaN) in numerical and categorical features.
   
   **Parameters:**
   
   * **N_data** (*ArrayDict*) -- Numerical features dictionary
   * **C_data** (*ArrayDict*) -- Categorical features dictionary
   * **num_nan_policy** (*str*) -- Policy for numerical NaN values ('mean', 'median')
   * **cat_nan_policy** (*str*) -- Policy for categorical NaN values ('new', 'most_frequent')
   * **num_new_value** (*Optional[np.ndarray]*) -- Pre-computed values for numerical NaN replacement
   * **imputer** (*Optional[SimpleImputer]*) -- Fitted imputer for categorical features
   * **cat_new_value** (*Optional[str]*) -- Value to replace categorical NaN values
   
   **Returns:**
   
   * **Tuple** -- (N, C, num_new_value, imputer, cat_new_value) where:
     * N: Processed numerical features
     * C: Processed categorical features
     * num_new_value: Values used for numerical NaN replacement
     * imputer: Fitted imputer for categorical features
     * cat_new_value: Value used for categorical NaN replacement
   
   **Numerical NaN Policies:**
   
   * **mean**: Replace NaN with mean of the feature
   * **median**: Replace NaN with median of the feature
   
   **Categorical NaN Policies:**
   
   * **new**: Replace NaN with special token '___null___'
   * **most_frequent**: Replace NaN with most frequent value using SimpleImputer

.. function:: num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None)
   :noindex:

   Apply numerical feature encoding/transformation policies.
   
   **Parameters:**
   
   * **N_data** (*ArrayDict*) -- Numerical features dictionary
   * **num_policy** (*str*) -- Numerical encoding policy
   * **n_bins** (*int, optional*) -- Number of bins for discretization. Defaults to 2.
   * **y_train** (*Optional[np.ndarray]*) -- Training labels for supervised encoding
   * **is_regression** (*bool, optional*) -- Whether task is regression. Defaults to False.
   * **encoder** (*Optional[PiecewiseLinearEncoding]*) -- Pre-fitted encoder
   
   **Returns:**
   
   * **Tuple** -- (N_data, encoder) where:
     * N_data: Transformed numerical features
     * encoder: Fitted encoder for future use
   
   **Encoding Policies:**
   
   * **none**: No transformation
   * **Q_PLE**: Quantile-based Piecewise Linear Encoding
   * **T_PLE**: Tree-based Piecewise Linear Encoding
   * **Q_Unary**: Quantile-based Unary Encoding
   * **T_Unary**: Tree-based Unary Encoding
   * **Q_bins**: Quantile-based Bins Encoding
   * **T_bins**: Tree-based Bins Encoding
   * **Q_Johnson**: Quantile-based Johnson Encoding
   * **T_Johnson**: Tree-based Johnson Encoding

.. function:: data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None)
   :noindex:

   Apply categorical feature encoding policies.
   
   **Parameters:**
   
   * **N_data** (*ArrayDict*) -- Numerical features dictionary
   * **C_data** (*ArrayDict*) -- Categorical features dictionary
   * **cat_policy** (*str*) -- Categorical encoding policy
   * **y_train** (*Optional[np.ndarray]*) -- Training labels for supervised encoding
   * **ord_encoder** (*Optional[OrdinalEncoder]*) -- Pre-fitted ordinal encoder
   * **mode_values** (*Optional[List[int]]*) -- Mode values for unknown categories
   * **cat_encoder** (*Optional[OneHotEncoder]*) -- Pre-fitted categorical encoder
   
   **Returns:**
   
   * **Tuple** -- (N_data, C_data, ord_encoder, mode_values, cat_encoder) where:
     * N_data: Updated numerical features (may be combined with categorical)
     * C_data: Encoded categorical features
     * ord_encoder: Fitted ordinal encoder
     * mode_values: Mode values for unknown categories
     * cat_encoder: Fitted categorical encoder
   
   **Encoding Policies:**
   
   * **indices**: Keep as integer indices (no further encoding)
   * **ordinal**: Ordinal encoding
   * **ohe**: One-hot encoding
   * **binary**: Binary encoding
   * **hash**: Hashing encoding
   * **loo**: Leave-one-out encoding
   * **target**: Target encoding
   * **catboost**: CatBoost encoding
   * **tabr_ohe**: Special one-hot encoding for TabR model

.. function:: data_norm_process(N_data, normalization, seed, normalizer=None)
   :noindex:

   Apply normalization to numerical features.
   
   **Parameters:**
   
   * **N_data** (*ArrayDict*) -- Numerical features dictionary
   * **normalization** (*str*) -- Normalization method
   * **seed** (*int*) -- Random seed for reproducible normalization
   * **normalizer** (*Optional[TransformerMixin]*) -- Pre-fitted normalizer
   
   **Returns:**
   
   * **Tuple** -- (N_data, normalizer) where:
     * N_data: Normalized numerical features
     * normalizer: Fitted normalizer for future use
   
   **Normalization Methods:**
   
   * **none**: No normalization
   * **standard**: StandardScaler (zero mean, unit variance)
   * **minmax**: MinMaxScaler (scale to [0, 1])
   * **quantile**: QuantileTransformer (normal distribution)
   * **maxabs**: MaxAbsScaler (scale by maximum absolute value)
   * **power**: PowerTransformer (Yeo-Johnson transformation)
   * **robust**: RobustScaler (robust to outliers)

Label Processing
----------------

.. function:: data_label_process(y_data, is_regression, info=None, encoder=None)
   :noindex:

   Process target labels for training.
   
   **Parameters:**
   
   * **y_data** (*ArrayDict*) -- Target labels dictionary
   * **is_regression** (*bool*) -- Whether task is regression
   * **info** (*Optional[Dict[str, Any]]*) -- Label processing information
   * **encoder** (*Optional[LabelEncoder]*) -- Pre-fitted label encoder
   
   **Returns:**
   
   * **Tuple** -- (y, info, encoder) where:
     * y: Processed labels
     * info: Label processing information
     * encoder: Fitted label encoder (None for regression)
   
   **Processing:**
   
   * **Regression**: Standardize labels using mean and standard deviation
   * **Classification**: Encode labels as integers using LabelEncoder

Data Loading for Training
-------------------------

.. function:: data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False)
   :noindex:

   Create PyTorch DataLoaders for training or inference.
   
   **Parameters:**
   
   * **is_regression** (*bool*) -- Whether task is regression
   * **X** (*Tuple[ArrayDict, ArrayDict]*) -- Tuple of (numerical_features, categorical_features)
   * **Y** (*ArrayDict*) -- Target labels
   * **y_info** (*Dict[str, Any]*) -- Label processing information
   * **device** (*torch.device*) -- Device to load data on (CPU/GPU)
   * **batch_size** (*int*) -- Batch size for DataLoader
   * **is_train** (*bool*) -- Whether creating loaders for training
   * **is_float** (*bool, optional*) -- Whether to use float32 precision. Defaults to False.
   
   **Returns:**
   
   * **Tuple** -- For training: (X_num, X_cat, Y, train_loader, val_loader, loss_fn)
   * **Tuple** -- For inference: (X_num, X_cat, Y, test_loader, loss_fn)
   
   **Features:**
   
   * Converts numpy arrays to PyTorch tensors
   * Moves data to specified device (CPU/GPU)
   * Sets appropriate data types (float32/float64)
   * Creates DataLoader with proper batch size and shuffling
   * Returns appropriate loss function

Utility Functions
-----------------

.. function:: to_tensors(data)
   :noindex:

   Convert numpy arrays to PyTorch tensors.
   
   **Parameters:**
   
   * **data** (*ArrayDict*) -- Dictionary of numpy arrays
   
   **Returns:**
   
   * **Dict[str, torch.Tensor]** -- Dictionary of PyTorch tensors

.. function:: get_categories(X_cat)
   :noindex:

   Get the number of unique categories for each categorical feature.
   
   **Parameters:**
   
   * **X_cat** (*Optional[Dict[str, torch.Tensor]]*) -- Categorical features dictionary
   
   **Returns:**
   
   * **Optional[List[int]]** -- List of category counts for each feature, or None if no categorical features

.. function:: raise_unknown(unknown_what, unknown_value)
   :noindex:

   Raise a ValueError for unknown parameter values.
   
   **Parameters:**
   
   * **unknown_what** (*str*) -- Description of the unknown parameter
   * **unknown_value** (*Any*) -- The unknown value that was provided
   
   **Raises:**
   
   * **ValueError** -- With descriptive error message

Constants
---------

.. data:: BINCLASS
   :noindex:
   
   String constant for binary classification task type.

.. data:: MULTICLASS
   :noindex:
   
   String constant for multi-class classification task type.

.. data:: REGRESSION
   :noindex:
   
   String constant for regression task type.

.. data:: ArrayDict
   :noindex:
   
   Type alias for dictionary mapping partition names to numpy arrays.