LAMDA-TALENT Data Module

The data module provides functionalities for handling datasets, including data loading, preprocessing, encoding, and normalization. It also provides helper functions for handling missing data and loading datasets from disk.

class TALENT.model.lib.data.Dataset(N: Optional[Dict[str, numpy.ndarray]], C: Optional[Dict[str, numpy.ndarray]], y: Dict[str, numpy.ndarray], info: Dict[str, Any])

Bases: object

C: Optional[Dict[str, ndarray]]
N: Optional[Dict[str, ndarray]]
info: Dict[str, Any]
property is_binclass: bool
property is_multiclass: bool
property is_regression: bool
property n_cat_features: int
property n_features: int
property n_num_features: int
size(part: str) int

Return the size of the dataset partition.

Args:

  • part: str

Returns: int

y: Dict[str, ndarray]
TALENT.model.lib.data.data_enc_process(N_data, C_data, cat_policy, y_train=None, ord_encoder=None, mode_values=None, cat_encoder=None)

Process the categorical features in the dataset.

Parameters
  • N_data – ArrayDict

  • C_data – ArrayDict

  • cat_policy – str

  • y_train – Optional[np.ndarray]

  • ord_encoder – Optional[OrdinalEncoder]

  • mode_values – Optional[List[int]]

  • cat_encoder – Optional[OneHotEncoder]

Returns

Tuple[ArrayDict, ArrayDict, Optional[OrdinalEncoder], Optional[List[int]], Optional[OneHotEncoder]]

TALENT.model.lib.data.data_label_process(y_data, is_regression, info=None, encoder=None)

Process the labels in the dataset.

Parameters
  • y_data – ArrayDict

  • is_regression – bool

  • info – Optional[Dict[str, Any]]

  • encoder – Optional[LabelEncoder]

Returns

Tuple[ArrayDict, Dict[str, Any], Optional[LabelEncoder]]

TALENT.model.lib.data.data_loader_process(is_regression, X, Y, y_info, device, batch_size, is_train, is_float=False)

Process the data loader.

Parameters
  • is_regression – bool

  • X – Tuple[ArrayDict, ArrayDict]

  • Y – ArrayDict

  • y_info – Dict[str, Any]

  • device – torch.device

  • batch_size – int

  • is_train – bool

Returns

Tuple[ArrayDict, ArrayDict, ArrayDict, DataLoader, DataLoader, Callable]

TALENT.model.lib.data.data_nan_process(N_data, C_data, num_nan_policy, cat_nan_policy, num_new_value=None, imputer=None, cat_new_value=None)

Process the NaN values in the dataset.

Parameters
  • N_data – ArrayDict

  • C_data – ArrayDict

  • num_nan_policy – str

  • cat_nan_policy – str

  • num_new_value – Optional[np.ndarray]

  • imputer – Optional[SimpleImputer]

  • cat_new_value – Optional[str]

Returns

Tuple[ArrayDict, ArrayDict, Optional[np.ndarray], Optional[SimpleImputer], Optional[str]]

TALENT.model.lib.data.data_norm_process(N_data, normalization, seed, normalizer=None)

Process the normalization of the dataset.

Parameters
  • N_data – ArrayDict

  • normalization – str

  • seed – int

  • normalizer – Optional[TransformerMixin]

Returns

Tuple[ArrayDict, Optional[TransformerMixin]]

TALENT.model.lib.data.dataname_to_numpy(dataset_name, dataset_path)

Load the dataset from the numpy files.

Parameters
  • dataset_name – str

  • dataset_path – str

Returns

Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]

TALENT.model.lib.data.get_categories(X_cat: Optional[Dict[str, Tensor]]) Optional[List[int]]

Get the categories for each categorical feature.

Parameters

X_cat – Optional[Dict[str, torch.Tensor]]

Returns

Optional[List[int]]

TALENT.model.lib.data.get_dataset(dataset_name, dataset_path)

Load the dataset from the numpy files.

Parameters
  • dataset_name – str

  • dataset_path – str

Returns

Tuple[ArrayDict, ArrayDict, ArrayDict, Dict[str, Any]]

TALENT.model.lib.data.load_json(path)
TALENT.model.lib.data.num_enc_process(N_data, num_policy, n_bins=2, y_train=None, is_regression=False, encoder=None)

Process the numerical features in the dataset.

Parameters
  • N_data – ArrayDict

  • num_policy – str

  • n_bins – int

  • y_train – Optional[np.ndarray]

  • is_regression – bool

  • encoder – Optional[PiecewiseLinearEncoding]

Returns

Tuple[ArrayDict, Optional[PiecewiseLinearEncoding]]

TALENT.model.lib.data.raise_unknown(unknown_what: str, unknown_value: Any)
TALENT.model.lib.data.to_tensors(data: Dict[str, ndarray]) Dict[str, Tensor]

Convert the numpy arrays to torch tensors.

Parameters

data – ArrayDict

Returns

Dict[str, torch.Tensor]

Classes

The Dataset class encapsulates the numerical, categorical features, and labels of the dataset, and provides properties to determine the type of task (binary classification, multiclass classification, or regression) and the number of features.

Properties:

  • is_binclass: Returns True if the task is binary classification.

  • is_multiclass: Returns True if the task is multiclass classification.

  • is_regression: Returns True if the task is regression.

  • n_num_features: Number of numerical features.

  • n_cat_features: Number of categorical features.

  • n_features: Total number of features (numerical + categorical).

  • size(part): Returns the size of a particular part of the dataset (e.g., ‘train’, ‘val’, ‘test’).

Functions