LAMDA-TALENT Data Module
The data module provides functionalities for handling datasets, including data loading, preprocessing, encoding, and normalization. It also provides helper functions for handling missing data and loading datasets from disk.
- class TALENT.model.lib.data.Dataset(N: Optional[Dict[str, numpy.ndarray]], C: Optional[Dict[str, numpy.ndarray]], y: Dict[str, numpy.ndarray], info: Dict[str, Any])
Bases:
object
- TALENT.model.lib.data.data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None)
Process the categorical features in the dataset.
- Parameters
N_data – ArrayDict
C_data – ArrayDict
cat_policy – str
y_train – Optional[np.ndarray]
ord_encoder – Optional[OrdinalEncoder]
mode_values – Optional[List[int]]
cat_encoder – Optional[OneHotEncoder]
- Returns
Tuple[ArrayDict, ArrayDict, Optional[OrdinalEncoder], Optional[List[int]], Optional[OneHotEncoder]]
- TALENT.model.lib.data.data_label_process(y_data, is_regression, info=None, encoder=None)
Process the labels in the dataset.
- Parameters
y_data – ArrayDict
is_regression – bool
info – Optional[Dict[str, Any]]
encoder – Optional[LabelEncoder]
- Returns
Tuple[ArrayDict, Dict[str, Any], Optional[LabelEncoder]]
- TALENT.model.lib.data.data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False)
Process the data loader.
- Parameters
is_regression – bool
X – Tuple[ArrayDict, ArrayDict]
Y – ArrayDict
y_info – Dict[str, Any]
device – torch.device
batch_size – int
is_train – bool
- Returns
Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, DataLoader, Callable]
- TALENT.model.lib.data.data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None)
Process the NaN values in the dataset.
- Parameters
N_data – ArrayDict
C_data – ArrayDict
num_nan_policy – str
cat_nan_policy – str
num_new_value – Optional[np.ndarray]
imputer – Optional[SimpleImputer]
cat_new_value – Optional[str]
- Returns
Tuple[ArrayDict, ArrayDict, Optional[np.ndarray], Optional[SimpleImputer], Optional[str]]
- TALENT.model.lib.data.data_norm_process(N_data, normalization, seed, normalizer=None)
Process the normalization of the dataset.
- Parameters
N_data – ArrayDict
normalization – str
seed – int
normalizer – Optional[TransformerMixin]
- Returns
Tuple[ArrayDict, Optional[TransformerMixin]]
- TALENT.model.lib.data.dataname_to_numpy(dataset_name, dataset_path)
Load the dataset from the numpy files.
- Parameters
dataset_name – str
dataset_path – str
- Returns
Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]
- TALENT.model.lib.data.get_categories(X_cat: Optional[Dict[str, Tensor]]) Optional[List[int]]
Get the categories for each categorical feature.
- Parameters
X_cat – Optional[Dict[str, torch.Tensor]]
- Returns
Optional[List[int]]
- TALENT.model.lib.data.get_dataset(dataset_name, dataset_path)
Load the dataset from the numpy files.
- Parameters
dataset_name – str
dataset_path – str
- Returns
Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]
- TALENT.model.lib.data.load_json(path)
- TALENT.model.lib.data.num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None)
Process the numerical features in the dataset.
- Parameters
N_data – ArrayDict
num_policy – str
n_bins – int
y_train – Optional[np.ndarray]
is_regression – bool
encoder – Optional[PiecewiseLinearEncoding]
- Returns
Tuple[ArrayDict, Optional[PiecewiseLinearEncoding]]
Classes
The Dataset class encapsulates the numerical, categorical features, and labels of the dataset, and provides properties to determine the type of task (binary classification, multiclass classification, or regression) and the number of features.
Properties:
is_binclass: Returns True if the task is binary classification.
is_multiclass: Returns True if the task is multiclass classification.
is_regression: Returns True if the task is regression.
n_num_features: Number of numerical features.
n_cat_features: Number of categorical features.
n_features: Total number of features (numerical + categorical).
size(part): Returns the size of a particular part of the dataset (e.g., ‘train’, ‘val’, ‘test’).