Data

class TALENT.model.lib.data.Dataset(N: Optional[Dict[str, numpy.ndarray]], C: Optional[Dict[str, numpy.ndarray]], y: Dict[str, numpy.ndarray], info: Dict[str, Any])

Bases: object

C: Optional[Dict[str, numpy.ndarray]]
N: Optional[Dict[str, numpy.ndarray]]
info: Dict[str, Any]
property is_binclass: bool
property is_multiclass: bool
property is_regression: bool
property n_cat_features: int
property n_features: int
property n_num_features: int
size(part: str) int

Return the size of the dataset partition.

Args:

  • part: str

Returns: int

y: Dict[str, numpy.ndarray]
TALENT.model.lib.data.data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None)

Process the categorical features in the dataset.

Parameters
  • N_data – ArrayDict

  • C_data – ArrayDict

  • cat_policy – str

  • y_train – Optional[np.ndarray]

  • ord_encoder – Optional[OrdinalEncoder]

  • mode_values – Optional[List[int]]

  • cat_encoder – Optional[OneHotEncoder]

Returns

Tuple[ArrayDict, ArrayDict, Optional[OrdinalEncoder], Optional[List[int]], Optional[OneHotEncoder]]

TALENT.model.lib.data.data_label_process(y_data, is_regression, info=None, encoder=None)

Process the labels in the dataset.

Parameters
  • y_data – ArrayDict

  • is_regression – bool

  • info – Optional[Dict[str, Any]]

  • encoder – Optional[LabelEncoder]

Returns

Tuple[ArrayDict, Dict[str, Any], Optional[LabelEncoder]]

TALENT.model.lib.data.data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False)

Process the data loader.

Parameters
  • is_regression – bool

  • X – Tuple[ArrayDict, ArrayDict]

  • Y – ArrayDict

  • y_info – Dict[str, Any]

  • device – torch.device

  • batch_size – int

  • is_train – bool

Returns

Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, DataLoader, Callable]

TALENT.model.lib.data.data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None)

Process the NaN values in the dataset.

Parameters
  • N_data – ArrayDict

  • C_data – ArrayDict

  • num_nan_policy – str

  • cat_nan_policy – str

  • num_new_value – Optional[np.ndarray]

  • imputer – Optional[SimpleImputer]

  • cat_new_value – Optional[str]

Returns

Tuple[ArrayDict, ArrayDict, Optional[np.ndarray], Optional[SimpleImputer], Optional[str]]

TALENT.model.lib.data.data_norm_process(N_data, normalization, seed, normalizer=None)

Process the normalization of the dataset.

Parameters
  • N_data – ArrayDict

  • normalization – str

  • seed – int

  • normalizer – Optional[TransformerMixin]

Returns

Tuple[ArrayDict, Optional[TransformerMixin]]

TALENT.model.lib.data.dataname_to_numpy(dataset_name, dataset_path)

Load the dataset from the numpy files.

Parameters
  • dataset_name – str

  • dataset_path – str

Returns

Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]

TALENT.model.lib.data.get_categories(X_cat: Optional[Dict[str, torch.Tensor]]) Optional[List[int]]

Get the categories for each categorical feature.

Parameters

X_cat – Optional[Dict[str, torch.Tensor]]

Returns

Optional[List[int]]

TALENT.model.lib.data.get_dataset(dataset_name, dataset_path)

Load the dataset from the numpy files.

Parameters
  • dataset_name – str

  • dataset_path – str

Returns

Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]

TALENT.model.lib.data.load_json(path)
TALENT.model.lib.data.num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None)

Process the numerical features in the dataset.

Parameters
  • N_data – ArrayDict

  • num_policy – str

  • n_bins – int

  • y_train – Optional[np.ndarray]

  • is_regression – bool

  • encoder – Optional[PiecewiseLinearEncoding]

Returns

Tuple[ArrayDict, Optional[PiecewiseLinearEncoding]]

TALENT.model.lib.data.raise_unknown(unknown_what: str, unknown_value: Any)
TALENT.model.lib.data.to_tensors(data: Dict[str, numpy.ndarray]) Dict[str, torch.Tensor]

Convert the numpy arrays to torch tensors.

Parameters

data – ArrayDict

Returns

Dict[str, torch.Tensor]

Data Structures

class TALENT.model.lib.data.Dataset

A comprehensive data structure for representing tabular datasets with numerical and categorical features.

Attributes:

  • N (Optional[ArrayDict]) – Numerical features dictionary with ‘train’, ‘val’, ‘test’ keys

  • C (Optional[ArrayDict]) – Categorical features dictionary with ‘train’, ‘val’, ‘test’ keys

  • y (ArrayDict) – Target labels dictionary with ‘train’, ‘val’, ‘test’ keys

  • info (Dict[str, Any]) – Dataset metadata including task type and feature counts

Properties:

property TALENT.model.lib.data.is_binclass

Check if the dataset is for binary classification.

Returns:

  • bool – True if task_type is ‘binclass’

property TALENT.model.lib.data.is_multiclass

Check if the dataset is for multi-class classification.

Returns:

  • bool – True if task_type is ‘multiclass’

property TALENT.model.lib.data.is_regression

Check if the dataset is for regression.

Returns:

  • bool – True if task_type is ‘regression’

property TALENT.model.lib.data.n_num_features

Get the number of numerical features.

Returns:

  • int – Number of numerical features

property TALENT.model.lib.data.n_cat_features

Get the number of categorical features.

Returns:

  • int – Number of categorical features

property TALENT.model.lib.data.n_features

Get the total number of features.

Returns:

  • int – Total number of features (numerical + categorical)

TALENT.model.lib.data.size(part)

Get the size of a specific dataset partition.

Parameters:

  • part (str) – Dataset partition (‘train’, ‘val’, or ‘test’)

Returns:

  • int – Number of samples in the specified partition

Data Loading

TALENT.model.lib.data.dataname_to_numpy(dataset_name, dataset_path)

Load dataset from numpy files stored in the specified directory structure.

Parameters:

  • dataset_name (str) – Name of the dataset directory

  • dataset_path (str) – Base path to the dataset directory

Returns:

  • Tuple – (N, C, y, info) where: * N: Numerical features dictionary or None * C: Categorical features dictionary or None * y: Target labels dictionary * info: Dataset metadata from info.json

Expected Directory Structure:

dataset_path/dataset_name/
├── N_train.npy (optional)
├── N_val.npy (optional)
├── N_test.npy (optional)
├── C_train.npy (optional)
├── C_val.npy (optional)
├── C_test.npy (optional)
├── y_train.npy
├── y_val.npy
├── y_test.npy
└── info.json
TALENT.model.lib.data.get_dataset(dataset_name, dataset_path)

Load and split dataset into training/validation and test sets.

Parameters:

  • dataset_name (str) – Name of the dataset directory

  • dataset_path (str) – Base path to the dataset directory

Returns:

  • Tuple – (train_val_data, test_data, info) where: * train_val_data: Tuple of (N_trainval, C_trainval, y_trainval) * test_data: Tuple of (N_test, C_test, y_test) * info: Dataset metadata

TALENT.model.lib.data.load_json(path)

Load and parse a JSON file.

Parameters:

  • path (str) – Path to the JSON file

Returns:

  • dict – Parsed JSON content

Data Preprocessing

TALENT.model.lib.data.data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None)

Handle missing values (NaN) in numerical and categorical features.

Parameters:

  • N_data (ArrayDict) – Numerical features dictionary

  • C_data (ArrayDict) – Categorical features dictionary

  • num_nan_policy (str) – Policy for numerical NaN values (‘mean’, ‘median’)

  • cat_nan_policy (str) – Policy for categorical NaN values (‘new’, ‘most_frequent’)

  • num_new_value (Optional[np.ndarray]) – Pre-computed values for numerical NaN replacement

  • imputer (Optional[SimpleImputer]) – Fitted imputer for categorical features

  • cat_new_value (Optional[str]) – Value to replace categorical NaN values

Returns:

  • Tuple – (N, C, num_new_value, imputer, cat_new_value) where: * N: Processed numerical features * C: Processed categorical features * num_new_value: Values used for numerical NaN replacement * imputer: Fitted imputer for categorical features * cat_new_value: Value used for categorical NaN replacement

Numerical NaN Policies:

  • mean: Replace NaN with mean of the feature

  • median: Replace NaN with median of the feature

Categorical NaN Policies:

  • new: Replace NaN with special token ‘___null___’

  • most_frequent: Replace NaN with most frequent value using SimpleImputer

TALENT.model.lib.data.num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None)

Apply numerical feature encoding/transformation policies.

Parameters:

  • N_data (ArrayDict) – Numerical features dictionary

  • num_policy (str) – Numerical encoding policy

  • n_bins (int, optional) – Number of bins for discretization. Defaults to 2.

  • y_train (Optional[np.ndarray]) – Training labels for supervised encoding

  • is_regression (bool, optional) – Whether task is regression. Defaults to False.

  • encoder (Optional[PiecewiseLinearEncoding]) – Pre-fitted encoder

Returns:

  • Tuple – (N_data, encoder) where: * N_data: Transformed numerical features * encoder: Fitted encoder for future use

Encoding Policies:

  • none: No transformation

  • Q_PLE: Quantile-based Piecewise Linear Encoding

  • T_PLE: Tree-based Piecewise Linear Encoding

  • Q_Unary: Quantile-based Unary Encoding

  • T_Unary: Tree-based Unary Encoding

  • Q_bins: Quantile-based Bins Encoding

  • T_bins: Tree-based Bins Encoding

  • Q_Johnson: Quantile-based Johnson Encoding

  • T_Johnson: Tree-based Johnson Encoding

TALENT.model.lib.data.data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None)

Apply categorical feature encoding policies.

Parameters:

  • N_data (ArrayDict) – Numerical features dictionary

  • C_data (ArrayDict) – Categorical features dictionary

  • cat_policy (str) – Categorical encoding policy

  • y_train (Optional[np.ndarray]) – Training labels for supervised encoding

  • ord_encoder (Optional[OrdinalEncoder]) – Pre-fitted ordinal encoder

  • mode_values (Optional[List[int]]) – Mode values for unknown categories

  • cat_encoder (Optional[OneHotEncoder]) – Pre-fitted categorical encoder

Returns:

  • Tuple – (N_data, C_data, ord_encoder, mode_values, cat_encoder) where: * N_data: Updated numerical features (may be combined with categorical) * C_data: Encoded categorical features * ord_encoder: Fitted ordinal encoder * mode_values: Mode values for unknown categories * cat_encoder: Fitted categorical encoder

Encoding Policies:

  • indices: Keep as integer indices (no further encoding)

  • ordinal: Ordinal encoding

  • ohe: One-hot encoding

  • binary: Binary encoding

  • hash: Hashing encoding

  • loo: Leave-one-out encoding

  • target: Target encoding

  • catboost: CatBoost encoding

  • tabr_ohe: Special one-hot encoding for TabR model

TALENT.model.lib.data.data_norm_process(N_data, normalization, seed, normalizer=None)

Apply normalization to numerical features.

Parameters:

  • N_data (ArrayDict) – Numerical features dictionary

  • normalization (str) – Normalization method

  • seed (int) – Random seed for reproducible normalization

  • normalizer (Optional[TransformerMixin]) – Pre-fitted normalizer

Returns:

  • Tuple – (N_data, normalizer) where: * N_data: Normalized numerical features * normalizer: Fitted normalizer for future use

Normalization Methods:

  • none: No normalization

  • standard: StandardScaler (zero mean, unit variance)

  • minmax: MinMaxScaler (scale to [0, 1])

  • quantile: QuantileTransformer (normal distribution)

  • maxabs: MaxAbsScaler (scale by maximum absolute value)

  • power: PowerTransformer (Yeo-Johnson transformation)

  • robust: RobustScaler (robust to outliers)

Label Processing

TALENT.model.lib.data.data_label_process(y_data, is_regression, info=None, encoder=None)

Process target labels for training.

Parameters:

  • y_data (ArrayDict) – Target labels dictionary

  • is_regression (bool) – Whether task is regression

  • info (Optional[Dict[str, Any]]) – Label processing information

  • encoder (Optional[LabelEncoder]) – Pre-fitted label encoder

Returns:

  • Tuple – (y, info, encoder) where: * y: Processed labels * info: Label processing information * encoder: Fitted label encoder (None for regression)

Processing:

  • Regression: Standardize labels using mean and standard deviation

  • Classification: Encode labels as integers using LabelEncoder

Data Loading for Training

TALENT.model.lib.data.data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False)

Create PyTorch DataLoaders for training or inference.

Parameters:

  • is_regression (bool) – Whether task is regression

  • X (Tuple[ArrayDict, ArrayDict]) – Tuple of (numerical_features, categorical_features)

  • Y (ArrayDict) – Target labels

  • y_info (Dict[str, Any]) – Label processing information

  • device (torch.device) – Device to load data on (CPU/GPU)

  • batch_size (int) – Batch size for DataLoader

  • is_train (bool) – Whether creating loaders for training

  • is_float (bool, optional) – Whether to use float32 precision. Defaults to False.

Returns:

  • Tuple – For training: (X_num, X_cat, Y, train_loader, val_loader, loss_fn)

  • Tuple – For inference: (X_num, X_cat, Y, test_loader, loss_fn)

Features:

  • Converts numpy arrays to PyTorch tensors

  • Moves data to specified device (CPU/GPU)

  • Sets appropriate data types (float32/float64)

  • Creates DataLoader with proper batch size and shuffling

  • Returns appropriate loss function

Utility Functions

TALENT.model.lib.data.to_tensors(data)

Convert numpy arrays to PyTorch tensors.

Parameters:

  • data (ArrayDict) – Dictionary of numpy arrays

Returns:

  • Dict[str, torch.Tensor] – Dictionary of PyTorch tensors

TALENT.model.lib.data.get_categories(X_cat)

Get the number of unique categories for each categorical feature.

Parameters:

  • X_cat (Optional[Dict[str, torch.Tensor]]) – Categorical features dictionary

Returns:

  • Optional[List[int]] – List of category counts for each feature, or None if no categorical features

TALENT.model.lib.data.raise_unknown(unknown_what, unknown_value)

Raise a ValueError for unknown parameter values.

Parameters:

  • unknown_what (str) – Description of the unknown parameter

  • unknown_value (Any) – The unknown value that was provided

Raises:

  • ValueError – With descriptive error message

Constants

TALENT.model.lib.data.BINCLASS

String constant for binary classification task type.

TALENT.model.lib.data.MULTICLASS

String constant for multi-class classification task type.

TALENT.model.lib.data.REGRESSION

String constant for regression task type.

TALENT.model.lib.data.ArrayDict

Type alias for dictionary mapping partition names to numpy arrays.