Data
- class TALENT.model.lib.data.Dataset(N: Optional[Dict[str, numpy.ndarray]], C: Optional[Dict[str, numpy.ndarray]], y: Dict[str, numpy.ndarray], info: Dict[str, Any])
Bases:
object
- TALENT.model.lib.data.data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None)
Process the categorical features in the dataset.
- Parameters
N_data – ArrayDict
C_data – ArrayDict
cat_policy – str
y_train – Optional[np.ndarray]
ord_encoder – Optional[OrdinalEncoder]
mode_values – Optional[List[int]]
cat_encoder – Optional[OneHotEncoder]
- Returns
Tuple[ArrayDict, ArrayDict, Optional[OrdinalEncoder], Optional[List[int]], Optional[OneHotEncoder]]
- TALENT.model.lib.data.data_label_process(y_data, is_regression, info=None, encoder=None)
Process the labels in the dataset.
- Parameters
y_data – ArrayDict
is_regression – bool
info – Optional[Dict[str, Any]]
encoder – Optional[LabelEncoder]
- Returns
Tuple[ArrayDict, Dict[str, Any], Optional[LabelEncoder]]
- TALENT.model.lib.data.data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False)
Process the data loader.
- Parameters
is_regression – bool
X – Tuple[ArrayDict, ArrayDict]
Y – ArrayDict
y_info – Dict[str, Any]
device – torch.device
batch_size – int
is_train – bool
- Returns
Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, DataLoader, Callable]
- TALENT.model.lib.data.data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None)
Process the NaN values in the dataset.
- Parameters
N_data – ArrayDict
C_data – ArrayDict
num_nan_policy – str
cat_nan_policy – str
num_new_value – Optional[np.ndarray]
imputer – Optional[SimpleImputer]
cat_new_value – Optional[str]
- Returns
Tuple[ArrayDict, ArrayDict, Optional[np.ndarray], Optional[SimpleImputer], Optional[str]]
- TALENT.model.lib.data.data_norm_process(N_data, normalization, seed, normalizer=None)
Process the normalization of the dataset.
- Parameters
N_data – ArrayDict
normalization – str
seed – int
normalizer – Optional[TransformerMixin]
- Returns
Tuple[ArrayDict, Optional[TransformerMixin]]
- TALENT.model.lib.data.dataname_to_numpy(dataset_name, dataset_path)
Load the dataset from the numpy files.
- Parameters
dataset_name – str
dataset_path – str
- Returns
Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]
- TALENT.model.lib.data.get_categories(X_cat: Optional[Dict[str, torch.Tensor]]) Optional[List[int]]
Get the categories for each categorical feature.
- Parameters
X_cat – Optional[Dict[str, torch.Tensor]]
- Returns
Optional[List[int]]
- TALENT.model.lib.data.get_dataset(dataset_name, dataset_path)
Load the dataset from the numpy files.
- Parameters
dataset_name – str
dataset_path – str
- Returns
Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]
- TALENT.model.lib.data.load_json(path)
- TALENT.model.lib.data.num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None)
Process the numerical features in the dataset.
- Parameters
N_data – ArrayDict
num_policy – str
n_bins – int
y_train – Optional[np.ndarray]
is_regression – bool
encoder – Optional[PiecewiseLinearEncoding]
- Returns
Tuple[ArrayDict, Optional[PiecewiseLinearEncoding]]
- TALENT.model.lib.data.to_tensors(data: Dict[str, numpy.ndarray]) Dict[str, torch.Tensor]
Convert the numpy arrays to torch tensors.
- Parameters
data – ArrayDict
- Returns
Dict[str, torch.Tensor]
Data Structures
- class TALENT.model.lib.data.Dataset
A comprehensive data structure for representing tabular datasets with numerical and categorical features.
Attributes:
N (Optional[ArrayDict]) – Numerical features dictionary with ‘train’, ‘val’, ‘test’ keys
C (Optional[ArrayDict]) – Categorical features dictionary with ‘train’, ‘val’, ‘test’ keys
y (ArrayDict) – Target labels dictionary with ‘train’, ‘val’, ‘test’ keys
info (Dict[str, Any]) – Dataset metadata including task type and feature counts
Properties:
- property TALENT.model.lib.data.is_binclass
Check if the dataset is for binary classification.
Returns:
bool – True if task_type is ‘binclass’
- property TALENT.model.lib.data.is_multiclass
Check if the dataset is for multi-class classification.
Returns:
bool – True if task_type is ‘multiclass’
- property TALENT.model.lib.data.is_regression
Check if the dataset is for regression.
Returns:
bool – True if task_type is ‘regression’
- property TALENT.model.lib.data.n_num_features
Get the number of numerical features.
Returns:
int – Number of numerical features
- property TALENT.model.lib.data.n_cat_features
Get the number of categorical features.
Returns:
int – Number of categorical features
- property TALENT.model.lib.data.n_features
Get the total number of features.
Returns:
int – Total number of features (numerical + categorical)
- TALENT.model.lib.data.size(part)
Get the size of a specific dataset partition.
Parameters:
part (str) – Dataset partition (‘train’, ‘val’, or ‘test’)
Returns:
int – Number of samples in the specified partition
Data Loading
- TALENT.model.lib.data.dataname_to_numpy(dataset_name, dataset_path)
Load dataset from numpy files stored in the specified directory structure.
Parameters:
dataset_name (str) – Name of the dataset directory
dataset_path (str) – Base path to the dataset directory
Returns:
Tuple – (N, C, y, info) where: * N: Numerical features dictionary or None * C: Categorical features dictionary or None * y: Target labels dictionary * info: Dataset metadata from info.json
Expected Directory Structure:
dataset_path/dataset_name/ ├── N_train.npy (optional) ├── N_val.npy (optional) ├── N_test.npy (optional) ├── C_train.npy (optional) ├── C_val.npy (optional) ├── C_test.npy (optional) ├── y_train.npy ├── y_val.npy ├── y_test.npy └── info.json
- TALENT.model.lib.data.get_dataset(dataset_name, dataset_path)
Load and split dataset into training/validation and test sets.
Parameters:
dataset_name (str) – Name of the dataset directory
dataset_path (str) – Base path to the dataset directory
Returns:
Tuple – (train_val_data, test_data, info) where: * train_val_data: Tuple of (N_trainval, C_trainval, y_trainval) * test_data: Tuple of (N_test, C_test, y_test) * info: Dataset metadata
- TALENT.model.lib.data.load_json(path)
Load and parse a JSON file.
Parameters:
path (str) – Path to the JSON file
Returns:
dict – Parsed JSON content
Data Preprocessing
- TALENT.model.lib.data.data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None)
Handle missing values (NaN) in numerical and categorical features.
Parameters:
N_data (ArrayDict) – Numerical features dictionary
C_data (ArrayDict) – Categorical features dictionary
num_nan_policy (str) – Policy for numerical NaN values (‘mean’, ‘median’)
cat_nan_policy (str) – Policy for categorical NaN values (‘new’, ‘most_frequent’)
num_new_value (Optional[np.ndarray]) – Pre-computed values for numerical NaN replacement
imputer (Optional[SimpleImputer]) – Fitted imputer for categorical features
cat_new_value (Optional[str]) – Value to replace categorical NaN values
Returns:
Tuple – (N, C, num_new_value, imputer, cat_new_value) where: * N: Processed numerical features * C: Processed categorical features * num_new_value: Values used for numerical NaN replacement * imputer: Fitted imputer for categorical features * cat_new_value: Value used for categorical NaN replacement
Numerical NaN Policies:
mean: Replace NaN with mean of the feature
median: Replace NaN with median of the feature
Categorical NaN Policies:
new: Replace NaN with special token ‘___null___’
most_frequent: Replace NaN with most frequent value using SimpleImputer
- TALENT.model.lib.data.num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None)
Apply numerical feature encoding/transformation policies.
Parameters:
N_data (ArrayDict) – Numerical features dictionary
num_policy (str) – Numerical encoding policy
n_bins (int, optional) – Number of bins for discretization. Defaults to 2.
y_train (Optional[np.ndarray]) – Training labels for supervised encoding
is_regression (bool, optional) – Whether task is regression. Defaults to False.
encoder (Optional[PiecewiseLinearEncoding]) – Pre-fitted encoder
Returns:
Tuple – (N_data, encoder) where: * N_data: Transformed numerical features * encoder: Fitted encoder for future use
Encoding Policies:
none: No transformation
Q_PLE: Quantile-based Piecewise Linear Encoding
T_PLE: Tree-based Piecewise Linear Encoding
Q_Unary: Quantile-based Unary Encoding
T_Unary: Tree-based Unary Encoding
Q_bins: Quantile-based Bins Encoding
T_bins: Tree-based Bins Encoding
Q_Johnson: Quantile-based Johnson Encoding
T_Johnson: Tree-based Johnson Encoding
- TALENT.model.lib.data.data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None)
Apply categorical feature encoding policies.
Parameters:
N_data (ArrayDict) – Numerical features dictionary
C_data (ArrayDict) – Categorical features dictionary
cat_policy (str) – Categorical encoding policy
y_train (Optional[np.ndarray]) – Training labels for supervised encoding
ord_encoder (Optional[OrdinalEncoder]) – Pre-fitted ordinal encoder
mode_values (Optional[List[int]]) – Mode values for unknown categories
cat_encoder (Optional[OneHotEncoder]) – Pre-fitted categorical encoder
Returns:
Tuple – (N_data, C_data, ord_encoder, mode_values, cat_encoder) where: * N_data: Updated numerical features (may be combined with categorical) * C_data: Encoded categorical features * ord_encoder: Fitted ordinal encoder * mode_values: Mode values for unknown categories * cat_encoder: Fitted categorical encoder
Encoding Policies:
indices: Keep as integer indices (no further encoding)
ordinal: Ordinal encoding
ohe: One-hot encoding
binary: Binary encoding
hash: Hashing encoding
loo: Leave-one-out encoding
target: Target encoding
catboost: CatBoost encoding
tabr_ohe: Special one-hot encoding for TabR model
- TALENT.model.lib.data.data_norm_process(N_data, normalization, seed, normalizer=None)
Apply normalization to numerical features.
Parameters:
N_data (ArrayDict) – Numerical features dictionary
normalization (str) – Normalization method
seed (int) – Random seed for reproducible normalization
normalizer (Optional[TransformerMixin]) – Pre-fitted normalizer
Returns:
Tuple – (N_data, normalizer) where: * N_data: Normalized numerical features * normalizer: Fitted normalizer for future use
Normalization Methods:
none: No normalization
standard: StandardScaler (zero mean, unit variance)
minmax: MinMaxScaler (scale to [0, 1])
quantile: QuantileTransformer (normal distribution)
maxabs: MaxAbsScaler (scale by maximum absolute value)
power: PowerTransformer (Yeo-Johnson transformation)
robust: RobustScaler (robust to outliers)
Label Processing
- TALENT.model.lib.data.data_label_process(y_data, is_regression, info=None, encoder=None)
Process target labels for training.
Parameters:
y_data (ArrayDict) – Target labels dictionary
is_regression (bool) – Whether task is regression
info (Optional[Dict[str, Any]]) – Label processing information
encoder (Optional[LabelEncoder]) – Pre-fitted label encoder
Returns:
Tuple – (y, info, encoder) where: * y: Processed labels * info: Label processing information * encoder: Fitted label encoder (None for regression)
Processing:
Regression: Standardize labels using mean and standard deviation
Classification: Encode labels as integers using LabelEncoder
Data Loading for Training
- TALENT.model.lib.data.data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False)
Create PyTorch DataLoaders for training or inference.
Parameters:
is_regression (bool) – Whether task is regression
X (Tuple[ArrayDict, ArrayDict]) – Tuple of (numerical_features, categorical_features)
Y (ArrayDict) – Target labels
y_info (Dict[str, Any]) – Label processing information
device (torch.device) – Device to load data on (CPU/GPU)
batch_size (int) – Batch size for DataLoader
is_train (bool) – Whether creating loaders for training
is_float (bool, optional) – Whether to use float32 precision. Defaults to False.
Returns:
Tuple – For training: (X_num, X_cat, Y, train_loader, val_loader, loss_fn)
Tuple – For inference: (X_num, X_cat, Y, test_loader, loss_fn)
Features:
Converts numpy arrays to PyTorch tensors
Moves data to specified device (CPU/GPU)
Sets appropriate data types (float32/float64)
Creates DataLoader with proper batch size and shuffling
Returns appropriate loss function
Utility Functions
- TALENT.model.lib.data.to_tensors(data)
Convert numpy arrays to PyTorch tensors.
Parameters:
data (ArrayDict) – Dictionary of numpy arrays
Returns:
Dict[str, torch.Tensor] – Dictionary of PyTorch tensors
- TALENT.model.lib.data.get_categories(X_cat)
Get the number of unique categories for each categorical feature.
Parameters:
X_cat (Optional[Dict[str, torch.Tensor]]) – Categorical features dictionary
Returns:
Optional[List[int]] – List of category counts for each feature, or None if no categorical features
- TALENT.model.lib.data.raise_unknown(unknown_what, unknown_value)
Raise a ValueError for unknown parameter values.
Parameters:
unknown_what (str) – Description of the unknown parameter
unknown_value (Any) – The unknown value that was provided
Raises:
ValueError – With descriptive error message
Constants
- TALENT.model.lib.data.BINCLASS
String constant for binary classification task type.
- TALENT.model.lib.data.MULTICLASS
String constant for multi-class classification task type.
- TALENT.model.lib.data.REGRESSION
String constant for regression task type.
- TALENT.model.lib.data.ArrayDict
Type alias for dictionary mapping partition names to numpy arrays.