Models

Deep learning models for tabular data, implementing various state-of-the-art architectures.

This section contains all the neural network architectures implemented in TALENT, ranging from simple MLPs to advanced transformer-based models specifically designed for tabular data. Each model implements specific forward pass computations, mathematical operations, and architectural innovations.

Basic Neural Networks

Multi-Layer Perceptron (MLP)

class TALENT.model.models.mlp.MLP(*args: Any, **kwargs: Any)

Bases: Module

forward(x, x_cat=None)
class TALENT.model.models.mlp.MLP

Simple feedforward neural network with multiple fully connected layers and ReLU activations.

Mathematical Formulation:

For input \(x \in \mathbb{R}^{d_{in}}\), the MLP computes:

\[\begin{split}h_0 &= x \\ h_i &= \text{ReLU}(\text{Linear}(h_{i-1})) = \text{ReLU}(W_i h_{i-1} + b_i) \\ \text{output} &= W_{\text{head}} h_L + b_{\text{head}}\end{split}\]

where \(L\) is the number of hidden layers.

__init__(d_in, d_out, d_layers, dropout)

Initialize the MLP architecture.

Parameters:

  • d_in (int) – Input feature dimension

  • d_out (int) – Output dimension (number of classes for classification, 1 for regression)

  • d_layers (List[int]) – Hidden layer dimensions, e.g., [64, 32] for two hidden layers

  • dropout (float) – Dropout probability applied after each hidden layer

Architecture Construction:

  1. Hidden Layers: Creates nn.Linear layers with dimensions specified in d_layers

  2. Output Head: Final linear layer mapping to output dimension

  3. Dropout Setup: Configures dropout for regularization during training

forward(x, x_cat=None)

Forward pass through the MLP network.

Parameters:

  • x (torch.Tensor) – Input numerical features of shape (batch_size, d_in)

  • x_cat (torch.Tensor, optional) – Categorical features (not used in MLP, maintained for interface consistency)

Returns:

  • torch.Tensor – Output predictions of shape (batch_size, d_out) or (batch_size,) for regression

Forward Pass Implementation:

for layer in self.layers:
    x = layer(x)  # Linear: x = W @ x + b
    x = F.relu(x)  # ReLU: x = max(0, x)
    if self.dropout:
        x = F.dropout(x, self.dropout, self.training)

logit = self.head(x)  # Final output layer
if self.d_out == 1:
    logit = logit.squeeze(-1)  # For regression

ReLU Activation:

\[\text{ReLU}(x) = \max(0, x)\]

Dropout Regularization:

During training, randomly sets elements to zero with probability dropout:

\[\begin{split}\text{Dropout}(x) = \begin{cases} \frac{x}{1-p} & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}\end{split}\]

Residual Network (ResNet)

class TALENT.model.models.resnet.ResNet(*args: Any, **kwargs: Any)

Bases: Module

forward(x: torch.Tensor, x_cat: torch.Tensor) torch.Tensor
TALENT.model.models.resnet.geglu(x)
TALENT.model.models.resnet.get_activation_fn(name)
TALENT.model.models.resnet.get_nonglu_activation_fn(name)
TALENT.model.models.resnet.reglu(x)
class TALENT.model.models.resnet.ResNet

Deep residual network with skip connections for tabular data, preventing gradient vanishing in deep architectures.

Mathematical Formulation:

ResNet uses residual blocks with skip connections:

\[h_{i+1} = h_i + F(h_i, W_i)\]

where \(F(h_i, W_i)\) is the residual function.

__init__(d_in, d, d_hidden_factor, n_layers, activation, normalization, hidden_dropout, residual_dropout, d_out)

Initialize the ResNet architecture with configurable components.

Parameters:

  • d_in (int) – Input feature dimension

  • d (int) – Hidden dimension for residual blocks

  • d_hidden_factor (float) – Factor to scale hidden layer width within blocks

  • n_layers (int) – Number of residual blocks

  • activation (str) – Activation function (‘relu’, ‘gelu’, ‘reglu’, ‘geglu’)

  • normalization (str) – Normalization type (‘batchnorm’, ‘layernorm’)

  • hidden_dropout (float) – Dropout probability within residual blocks

  • residual_dropout (float) – Dropout probability for residual connections

  • d_out (int) – Output dimension

forward(x, x_cat=None)

Forward pass through the ResNet architecture.

Parameters:

  • x (torch.Tensor) – Input numerical features

  • x_cat (torch.Tensor, optional) – Categorical features (not used)

Returns:

  • torch.Tensor – Output predictions

Residual Block Mathematical Implementation:

For each residual block, the computation follows:

\[\begin{split}\text{residual} &= \text{Norm}(h_i) \\ \text{residual} &= \text{Linear}(\text{residual}) \\ \text{residual} &= \text{Activation}(\text{residual}) \\ \text{residual} &= \text{Dropout}(\text{residual}) \\ \text{residual} &= \text{Linear}(\text{residual}) \\ \text{residual} &= \text{Dropout}(\text{residual}) \\ h_{i+1} &= h_i + \text{residual}\end{split}\]

Activation Functions:

  • ReLU: \(\text{ReLU}(x) = \max(0, x)\)

  • GELU: \(\text{GELU}(x) = x \cdot \Phi(x)\)

  • ReGLU: \(\text{ReGLU}(x) = a \cdot \text{ReLU}(b)\) where \(a, b = \text{split}(x)\)

  • GeGLU: \(\text{GeGLU}(x) = a \cdot \text{GELU}(b)\) where \(a, b = \text{split}(x)\)

reglu(x)

ReGLU activation function for gated linear units.

Mathematical Definition:

\[\text{ReGLU}(x) = a \cdot \text{ReLU}(b)\]

where \(a\) and \(b\) are obtained by splitting \(x\) along the last dimension.

geglu(x)

GeGLU activation function combining gating with GELU.

Mathematical Definition:

\[\text{GeGLU}(x) = a \cdot \text{GELU}(b)\]

where \(a\) and \(b\) are obtained by splitting \(x\) along the last dimension.

Self-Normalizing Network (SNN)

class TALENT.model.models.snn.SNN(*args: Any, **kwargs: Any)

Bases: Module

calculate_output(x: torch.Tensor) torch.Tensor
property d_embedding: int
encode(x_num, x_cat)
forward(x_num: torch.Tensor, x_cat) torch.Tensor
class TALENT.model.models.snn.SNN

Lightweight neural network with self-normalizing properties using SELU activation.

__init__(d_in, d_out, d_layers, dropout)

Initialize SNN with SELU activations for self-normalization.

Parameters:

  • d_in (int) – Input dimension

  • d_out (int) – Output dimension

  • d_layers (List[int]) – Hidden layer dimensions

  • dropout (float) – Dropout probability

forward(x, x_cat=None)

Forward pass with SELU activation for self-normalization.

SELU Activation Mathematical Definition:

\[\begin{split}\text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\end{split}\]

where \(\lambda \approx 1.0507\) and \(\alpha \approx 1.6733\).

Self-Normalization Property:

SELU ensures that for normalized inputs, activations maintain: - Mean converges to 0 - Variance converges to 1 - Enables training of very deep networks without explicit normalization

Transformer-Based Models

Feature Tokenizer Transformer (FT-Transformer)

class TALENT.model.models.ftt.MultiheadAttention(*args: Any, **kwargs: Any)

Bases: Module

forward(x_q: torch.Tensor, x_kv: torch.Tensor, key_compression: Optional[torch.nn.Linear], value_compression: Optional[torch.nn.Linear]) torch.Tensor
class TALENT.model.models.ftt.Tokenizer(*args: Any, **kwargs: Any)

Bases: Module

category_offsets: Optional[torch.Tensor]
forward(x_num: torch.Tensor, x_cat: Optional[torch.Tensor]) torch.Tensor
property n_tokens: int
class TALENT.model.models.ftt.Transformer(*args: Any, **kwargs: Any)

Bases: Module

Transformer.

References: - https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html - https://github.com/facebookresearch/pytext/tree/master/pytext/models/representations/transformer - https://github.com/pytorch/fairseq/blob/1bba712622b8ae4efb3eb793a8a40da386fe11d0/examples/linformer/linformer_src/modules/multihead_linear_attention.py#L19

forward(x_num: torch.Tensor, x_cat: Optional[torch.Tensor]) torch.Tensor
TALENT.model.models.ftt.geglu(x)
TALENT.model.models.ftt.get_activation_fn(name)
TALENT.model.models.ftt.get_nonglu_activation_fn(name)
TALENT.model.models.ftt.reglu(x)
class TALENT.model.models.ftt.Transformer

Advanced transformer architecture specifically designed for tabular data with feature tokenization.

Mathematical Formulation:

Feature Tokenization:

For numerical features: \(t_i^{\text{num}} = W_{\text{num}} x_i + b_{\text{num}}\)

For categorical features: \(t_i^{\text{cat}} = \text{Embedding}(x_i^{\text{cat}})\)

__init__(d_numerical, categories, d_token, n_layers, n_heads, d_ffn_factor, attention_dropout, ffn_dropout, residual_dropout, activation, prenormalization, d_out)

Initialize the FT-Transformer architecture.

Parameters:

  • d_numerical (int) – Number of numerical features

  • categories (List[int], optional) – Cardinalities for categorical features

  • d_token (int) – Token embedding dimension

  • n_layers (int) – Number of transformer layers

  • n_heads (int) – Number of attention heads

  • d_ffn_factor (float) – Factor for feed-forward network dimension

  • attention_dropout (float) – Dropout for attention weights

  • ffn_dropout (float) – Dropout for feed-forward network

  • residual_dropout (float) – Dropout for residual connections

  • activation (str) – Activation function for FFN

  • prenormalization (bool) – Whether to use pre-normalization

  • d_out (int) – Output dimension

forward(x_num, x_cat)

Forward pass through the transformer.

Parameters:

  • x_num (torch.Tensor, optional) – Numerical features of shape (batch_size, d_numerical)

  • x_cat (torch.Tensor, optional) – Categorical features of shape (batch_size, n_categorical)

Returns:

  • torch.Tensor – Output predictions

Transformer Processing Pipeline:

  1. Tokenization: Convert features to tokens using Tokenizer

  2. CLS Token Addition: Prepend classification token

  3. Transformer Layers: Apply multi-head attention and feed-forward networks

  4. Output Generation: Use CLS token representation for final prediction

Transformer Layer Mathematical Implementation:

For each transformer layer:

\[\begin{split}\text{attn_out} &= \text{MultiHeadAttention}(x, x, x) \\ x &= \text{LayerNorm}(x + \text{attn_out}) \\ \text{ffn_out} &= \text{FFN}(x) \\ x &= \text{LayerNorm}(x + \text{ffn_out})\end{split}\]
class TALENT.model.models.ftt.Tokenizer

Converts numerical and categorical features into token embeddings for transformer processing.

__init__(d_numerical, categories, d_token, bias)

Initialize the feature tokenizer.

Parameters:

  • d_numerical (int) – Number of numerical features

  • categories (List[int], optional) – Cardinalities of categorical features

  • d_token (int) – Token embedding dimension

  • bias (bool) – Whether to use bias in tokenization

forward(x_num, x_cat)

Convert features to token embeddings.

Tokenization Process:

Numerical Features:

\[\text{tokens}_{\text{num}} = x_{\text{num}} W_{\text{num}} + b_{\text{num}}\]

Categorical Features:

\[\text{tokens}_{\text{cat}} = \text{Embedding}(x_{\text{cat}} + \text{offsets})\]

CLS Token:

\[\text{tokens}_{\text{cls}} = W_{\text{cls}}\]
property n_tokens

Total number of tokens (numerical + categorical + CLS).

Returns:

  • int – Total token count

class TALENT.model.models.ftt.MultiheadAttention

Multi-head attention mechanism optimized for tabular data.

__init__(d, n_heads, dropout, bias)

Initialize multi-head attention.

Parameters:

  • d (int) – Input dimension

  • n_heads (int) – Number of attention heads

  • dropout (float) – Attention dropout probability

  • bias (bool) – Whether to use bias in projections

forward(x_q, x_kv, key_compression, value_compression)

Compute multi-head attention.

Parameters:

  • x_q (torch.Tensor) – Query input

  • x_kv (torch.Tensor) – Key and value input

  • key_compression (nn.Linear, optional) – Key compression layer

  • value_compression (nn.Linear, optional) – Value compression layer

Returns:

  • torch.Tensor – Attention output

Multi-Head Attention Mathematical Implementation:

  1. Linear Projections:

    \[Q = x_q W^Q, \quad K = x_{kv} W^K, \quad V = x_{kv} W^V\]
  2. Scaled Dot-Product Attention:

    \[\text{attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)\]
  3. Output Computation:

    \[\text{output} = \text{attention} \cdot V\]
  4. Multi-Head Combination:

    \[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O\]

Advanced Tabular Models

TabNet

class TALENT.model.models.tabnet.TabNetClassifier(*args: Any, **kwargs: Any)

Bases: TabModel

compute_loss(y_pred, y_true)

Compute the loss.

Parameters
  • y_score (a :tensor: torch.Tensor) – Score matrix

  • y_true (a :tensor: torch.Tensor) – Target matrix

Returns

Loss value

Return type

float

predict_func(outputs)
predict_proba(X)

Make predictions for classification on a batch (valid)

Parameters

X (a :tensor: torch.Tensor or matrix: scipy.sparse.csr_matrix) – Input data

Returns

res

Return type

np.ndarray

prepare_target(y)

Prepare target before training.

Parameters

y (a :tensor: torch.Tensor) – Target matrix.

Returns

Converted target matrix.

Return type

torch.Tensor

stack_batches(list_y_true, list_y_score)
update_fit_params(X_train, y_train, eval_set, weights)

Set attributes relative to fit function.

Parameters
  • X_train (np.ndarray) – Train set

  • y_train (np.array) – Train targets

  • eval_set (list of tuple) – List of eval tuple set (X, y).

  • weights (bool or dictionnary) – 0 for no balancing 1 for automated balancing

weight_updater(weights)

Updates weights dictionary according to target_mapper.

Parameters

weights (bool or dict) – Given weights for balancing training.

Returns

Same bool if weights are bool, updated dict otherwise.

Return type

bool or dict

class TALENT.model.models.tabnet.TabNetRegressor(*args: Any, **kwargs: Any)

Bases: TabModel

compute_loss(y_pred, y_true)

Compute the loss.

Parameters
  • y_score (a :tensor: torch.Tensor) – Score matrix

  • y_true (a :tensor: torch.Tensor) – Target matrix

Returns

Loss value

Return type

float

predict_func(outputs)
prepare_target(y)

Prepare target before training.

Parameters

y (a :tensor: torch.Tensor) – Target matrix.

Returns

Converted target matrix.

Return type

torch.Tensor

stack_batches(list_y_true, list_y_score)
update_fit_params(X_train, y_train, eval_set, weights)

Set attributes relative to fit function.

Parameters
  • X_train (np.ndarray) – Train set

  • y_train (np.array) – Train targets

  • eval_set (list of tuple) – List of eval tuple set (X, y).

  • weights (bool or dictionnary) – 0 for no balancing 1 for automated balancing

class TALENT.model.models.tabnet.TabNetClassifier

Interpretable deep learning model with sequential attention mechanism for classification.

Mathematical Formulation:

TabNet uses sequential feature selection through sparsemax attention:

Feature Selection at Step i:

\[M^{[i]} = \text{sparsemax}(\text{AttentionTransformer}(f^{[i-1]}))\]

Feature Processing:

\[f^{[i]} = \gamma \odot M^{[i]} \odot h + (1-\gamma) \odot f^{[i-1]}\]

where \(\gamma\) is the relaxation parameter.

__init__(n_steps, gamma, n_independent, n_shared, momentum, optimizer_params, scheduler_params, mask_type, lambda_sparse, seed)

Initialize TabNet classifier.

Parameters:

  • n_steps (int) – Number of decision steps

  • gamma (float) – Relaxation parameter for feature selection

  • n_independent (int) – Number of independent GLU layers per step

  • n_shared (int) – Number of shared GLU layers

  • momentum (float) – Momentum for batch normalization

  • optimizer_params (dict) – Optimizer configuration

  • scheduler_params (dict) – Learning rate scheduler parameters

  • mask_type (str) – Type of attention mask (‘sparsemax’ or ‘entmax’)

  • lambda_sparse (float) – Sparsity regularization coefficient

  • seed (int) – Random seed

fit(X_train, y_train, eval_set, eval_name, eval_metric, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last, callbacks)

Train the TabNet model.

Training Process:

  1. Data Preprocessing: Handle categorical encoding and normalization

  2. Sequential Training: Train each decision step sequentially

  3. Attention Regularization: Apply sparsity constraints on attention masks

  4. Early Stopping: Monitor validation metrics for convergence

predict_proba(X)

Make probability predictions for classification.

Parameters:

  • X (torch.Tensor or scipy.sparse matrix) – Input features

Returns:

  • np.ndarray – Class probabilities of shape (n_samples, n_classes)

Prediction Process:

  1. Forward Pass: Process through all decision steps

  2. Attention Aggregation: Combine attention from all steps

  3. Softmax Application: Convert logits to probabilities

\[P(y=k|x) = \frac{\exp(o_k)}{\sum_{j=1}^K \exp(o_j)}\]

where \(o_k\) is the raw output for class \(k\).

explain(X, normalize)

Generate feature importance explanations using attention masks.

Parameters:

  • X (torch.Tensor) – Input features

  • normalize (bool) – Whether to normalize importance scores

Returns:

  • np.ndarray – Feature importance matrix

Explanation Generation:

Attention masks from each decision step provide interpretable feature importance:

\[\text{importance}_{ij} = \frac{M^{[i]}_j}{\sum_{k=1}^{n_features} M^{[i]}_k}\]
class TALENT.model.models.tabnet.TabNetRegressor

TabNet for regression tasks with mean squared error optimization.

compute_loss(y_pred, y_true)

Compute mean squared error loss for regression.

MSE Loss Mathematical Definition:

\[\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

Tree-Based Neural Models

GRANDE (Gradient-Boosted Neural Decision Ensembles)

class TALENT.model.models.grande.GRANDE(*args: Any, **kwargs: Any)

Bases: Module

apply_preprocessing(X)
build_model()
entmax15(inputs, axis=- 1)
entmax_threshold_and_support(inputs, axis=- 1)
forward(inputs)
preprocess_data(X_train, y_train, X_val, y_val)
set_params(**kwargs)
class TALENT.model.models.grande.GRANDE

Tree-mimic neural network using gradient descent for decision tree simulation.

Mathematical Formulation:

GRANDE simulates decision trees using neural operations with entmax for sparse selection.

__init__(batch_size, task_type, depth, n_estimators, dropout)

Initialize GRANDE model.

Parameters:

  • batch_size (int) – Training batch size

  • task_type (str) – ‘classification’ or ‘regression’

  • depth (int) – Maximum tree depth

  • n_estimators (int) – Number of tree estimators

  • dropout (float) – Dropout probability

forward(inputs)

Forward pass through the GRANDE ensemble.

Parameters:

  • inputs (torch.Tensor) – Input features

Returns:

  • torch.Tensor – Ensemble predictions

Tree Simulation Mathematical Implementation:

  1. Split Decision Computation:

    \[\text{node_result} = \frac{\text{softsign}(s_1 - s_2) + 1}{2}\]

    where \(s_1\) are learned split thresholds and \(s_2\) are feature values.

  2. Path Probability Calculation:

    \[p = \prod_{j} ((1-\text{path_id}_j) \cdot \text{node_result}_j + \text{path_id}_j \cdot (1-\text{node_result}_j))\]
  3. Ensemble Output for Regression:

    \[\text{output} = \sum_{e,l} w_e \cdot p_{e,l} \cdot v_{e,l}\]

    where \(w_e\) are estimator weights, \(p_{e,l}\) are leaf probabilities, and \(v_{e,l}\) are leaf values.

  4. Ensemble Output for Classification:

    \[\text{output} = \sum_{e,l} w_e \cdot p_{e,l} \cdot \text{softmax}(v_{e,l})\]
get_representation(inputs)

Extract intermediate tree representations for analysis.

Returns:

  • torch.Tensor – Tree path representations

Neural Oblivious Decision Ensembles (NODE)

class TALENT.model.models.node.NODE(*, d_in: int, num_layers: int, layer_dim: int, depth: int, tree_dim: int, choice_function: str, bin_function: str, d_out: int)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x_num: Tensor, x_cat: Tensor) Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class TALENT.model.models.node.Node

Neural implementation of oblivious decision trees with differentiable splits.

__init__(input_dim, layer_dim, output_dim, num_layers, tree_dim, depth, choice_function, bin_function)

Initialize NODE architecture.

Parameters:

  • input_dim (int) – Input feature dimension

  • layer_dim (int) – Hidden layer dimension

  • output_dim (int) – Output dimension

  • num_layers (int) – Number of NODE layers

  • tree_dim (int) – Number of trees per layer

  • depth (int) – Tree depth

  • choice_function (str) – Function for feature selection (‘entmax15’)

  • bin_function (str) – Function for threshold selection (‘entmoid15’)

forward(x)

Forward pass through oblivious decision trees.

Decision Tree Mathematical Process:

  1. Feature Selection: Use entmax for sparse feature selection

  2. Threshold Comparison: Compare features with learned thresholds

  3. Path Aggregation: Aggregate predictions along tree paths

  4. Ensemble Combination: Combine outputs from multiple trees

GrowNet (Gradient Boosting with Neural Networks)

class TALENT.model.models.grownet.DynamicNet(lr, categories: Optional[List[int]], d_embedding: Optional[int])

Bases: object

add(model)
embed_input(x_num, x_cat)
forward(x_num, x_cat)
forward_grad(x_num, x_cat)
classmethod from_file(path, builder)
parameters()
to_cuda()
to_double()
to_eval()
to_file(path)
to_train()
zero_grad()
class TALENT.model.models.grownet.ForwardType(value)

Bases: Enum

An enumeration.

CASCADE = 2
GRADIENT = 3
SIMPLE = 0
STACKED = 1
class TALENT.model.models.grownet.MLP_2HL(dim_in, dim_hidden1, dim_hidden2, dim_out, sparse=False, bn=True)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x, lower_f)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod get_model(stage, opt)
class TALENT.model.models.grownet.SpLinear(input_features, output_features, bias=True)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class TALENT.model.models.grownet.SpLinearFunc(*args, **kwargs)

Bases: Function

static backward(ctx, grad_output)

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, input, weight, bias=None)

This function is to be overridden by all subclasses. There are two ways to define forward:

Usage 1 (Combined forward and ctx):

@staticmethod
def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    pass
  • It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).

  • See combining-forward-context for more details

Usage 2 (Separate forward and ctx):

@staticmethod
def forward(*args: Any, **kwargs: Any) -> Any:
    pass

@staticmethod
def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    pass
  • The forward no longer accepts a ctx argument.

  • Instead, you must also override the torch.autograd.Function.setup_context() staticmethod to handle setting up the ctx object. output is the output of the forward, inputs are a Tuple of inputs to the forward.

  • See extending-autograd for more details

The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with ctx.save_for_backward() if they are intended to be used in backward (equivalently, vjp) or ctx.save_for_forward() if they are intended to be used for in jvp.

class TALENT.model.models.grownet.GrowNet

Gradient boosting framework with neural network weak learners.

__init__(input_dim, output_dim, boost_rate, layers_per_net, layer_dims, dropout)

Initialize GrowNet with neural weak learners.

Gradient Boosting Process:

  1. Weak Learner Training: Train neural networks on residuals

  2. Boosting Update: Add weak learners with adaptive weights

  3. Gradient Computation: Compute gradients for next weak learner

forward(x)

Forward pass through the boosted ensemble.

Boosting Mathematical Formulation:

\[F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)\]

where \(h_m\) is the m-th weak learner and \(\gamma_m\) is the boosting rate.

Distance-Based Models

Modern Neighborhood Component Analysis (ModernNCA)

class TALENT.model.models.modernNCA.MLP_Block(*args: Any, **kwargs: Any)

Bases: Module

forward(x: torch.Tensor) torch.Tensor
class TALENT.model.models.modernNCA.ModernNCA(*args: Any, **kwargs: Any)

Bases: Module

forward(x, y, candidate_x, candidate_y, is_train)
make_layer()
class TALENT.model.models.modernNCA.ModernNCA

Neighborhood Component Analysis-inspired model for embedding-based predictions.

Mathematical Formulation:

ModernNCA learns embeddings for distance-based classification.

__init__(d_in, d_out, k, dropout, d_embedding)

Initialize ModernNCA model.

Parameters:

  • d_in (int) – Input feature dimension

  • d_out (int) – Output dimension (number of classes)

  • k (int) – Number of nearest neighbors to consider

  • dropout (float) – Dropout probability

  • d_embedding (int) – Embedding dimension

forward(x, y, candidate_x, candidate_y, is_train)

Forward pass with neighborhood analysis.

Parameters:

  • x (torch.Tensor) – Query features

  • y (torch.Tensor) – Query labels

  • candidate_x (torch.Tensor) – Candidate features for nearest neighbor search

  • candidate_y (torch.Tensor) – Candidate labels

  • is_train (bool) – Training mode flag

Returns:

  • torch.Tensor – Distance-based predictions

Distance-Based Prediction Mathematical Implementation:

  1. Embedding Computation:

    \[e_i = f(x_i), \quad e_j = f(x_j)\]

    where \(f\) is the learned embedding function.

  2. Distance Computation:

    \[d(x_i, x_j) = ||e_i - e_j||_2\]
  3. Neighbor Weighting:

    \[p_{ij} = \frac{\exp(-d(x_i, x_j))}{\sum_{k \neq i} \exp(-d(x_i, x_k))}\]
  4. Final Prediction:

    \[\hat{y}_i = \sum_j p_{ij} y_j\]
knn_prediction(x, candidate_x, candidate_y, k)

Make predictions using k-nearest neighbors in embedding space.

K-NN Process:

  1. Distance Calculation: Compute distances in embedding space

  2. Neighbor Selection: Find k nearest neighbors

  3. Prediction Aggregation: Aggregate neighbor labels with distance weighting

Specialized Architectures

ExcelFormer (Semi-Permeable Attention)

class TALENT.model.models.excelformer.ExcelFormer(*, d_numerical: int, token_bias: bool, n_layers: int, d_token: int, n_heads: int, attention_dropout: float, ffn_dropout: float, residual_dropout: float, prenormalization: bool, kv_compression: Optional[float], kv_compression_sharing: Optional[str], d_out: int, init_scale: float = 0.1)

Bases: Module

ExcelFormer with All initialized by small value

initial function: v4

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x_num: Tensor, x_cat=None, mix_up: bool = False, beta=0.5, mtype='feat_mix') Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class TALENT.model.models.excelformer.MultiheadAttention(d: int, n_heads: int, dropout: float, init_scale: float = 0.01)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x_q: Tensor, x_kv: Tensor, key_compression: Optional[Linear], value_compression: Optional[Linear]) Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_attention_mask(input_shape, device)
class TALENT.model.models.excelformer.Tokenizer(d_numerical: int, categories: Optional[List[int]], d_token: int, bias: bool)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

category_offsets: Optional[Tensor]
forward(x_num: Tensor) Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property n_tokens: int
TALENT.model.models.excelformer.attenuated_kaiming_uniform_(tensor, a=2.23606797749979, scale=1.0, mode='fan_in', nonlinearity='leaky_relu')
class TALENT.model.models.excelformer.ExcelFormer

Transformer with semi-permeable attention and mixup training capabilities.

__init__(d_numerical, d_token, n_blocks, attention_dropout, ffn_dropout, residual_dropout, d_out)

Initialize ExcelFormer architecture.

Parameters:

  • d_numerical (int) – Number of numerical features

  • d_token (int) – Token embedding dimension

  • n_blocks (int) – Number of transformer blocks

  • attention_dropout (float) – Attention dropout probability

  • ffn_dropout (float) – Feed-forward dropout probability

  • residual_dropout (float) – Residual connection dropout

  • d_out (int) – Output dimension

forward(x_num, x_cat, mix_up, beta, mtype)

Forward pass with optional mixup augmentation.

Parameters:

  • x_num (torch.Tensor) – Numerical features

  • x_cat (torch.Tensor, optional) – Categorical features

  • mix_up (bool) – Whether to apply mixup

  • beta (float) – Mixup parameter (default: 0.5)

  • mtype (str) – Mixup type (‘feat_mix’, ‘hidden_mix’, ‘naive_mix’)

Returns:

  • tuple – (output, feat_masks, shuffled_ids) for mixup training

Mixup Mathematical Implementation:

Feature Mixup:

\[\tilde{x} = \lambda x_i + (1-\lambda) x_j\]

Semi-Permeable Attention:

\[\text{Attention}_{\text{perm}}(Q, K, V) = \text{mask} \odot \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
mixup_process(x, beta, mtype)

Apply mixup augmentation to input features.

Mixup Types:

  • feat_mix: Feature-level mixing with learnable weights

  • hidden_mix: Hidden representation mixing

  • naive_mix: Simple linear interpolation

ProtoGate (Prototype-Based Gating)

class TALENT.model.models.protogate.DeactFunc(*args: Any, **kwargs: Any)

Bases: Module

forward(x)
class TALENT.model.models.protogate.GatingNet(*args: Any, **kwargs: Any)

Bases: Module

Gating Network for feature selection

Parameters
  • input_dim (int) – input dimension of the gating network

  • a (float) – coefficient in hard relu activation function

  • sigma (float) – std of the gaussion reparameterization noise

  • activation (str) – activation function of the gating net: ‘relu’, ‘l_relu’, ‘sigmoid’, ‘tanh’, or ‘none’

  • hidden_layer_list (list) – number of nodes for each hidden layer of the gating net, example: [200,200]

forward(x)
get_stochastic_gate(alpha)

This function replaced the feature_selector function in order to save Z

hard_sigmoid(x)

Segment-wise linear approximation of sigmoid. Faster than sigmoid. Returns 0. if x < -2.5, 1. if x > 2.5. In -2.5 <= x <= 2.5, returns 0.2 * x + 0.5. # Arguments

x: A tensor or variable.

# Returns

A tensor.

class TALENT.model.models.protogate.HybridSort(*args: Any, **kwargs: Any)

Bases: Module

forward(scores: torch.Tensor)

scores: elements to be sorted. Typical shape: batch_size x n x 1

class TALENT.model.models.protogate.KNNNet(*args: Any, **kwargs: Any)

Bases: Module

forward(query, neighbors, tau=1.0)
class TALENT.model.models.protogate.PL(*args: Any, **kwargs: Any)

Bases: Distribution

scores. Shape: (batch_size x) n tau: temperature for the relaxation. Scalar. hard: use straight-through estimation if True

arg_constraints = {'scores': torch.distributions.constraints.positive, 'tau': torch.distributions.constraints.positive}
has_rsample = True
log_prob(value)

value: permutation matrix. shape: batch_size x n x n

property mean
relaxed_sort(inp)

inp: elements to be sorted. Typical shape: batch_size x n x 1

rsample(sample_shape, log_score=True)

sample_shape: number of samples from the PL distribution. Scalar.

TALENT.model.models.protogate.get_activation(value)
class TALENT.model.models.protogate.ProtoGate

Prototype-based model with gating mechanisms for interpretable feature selection.

__init__(input_dim, output_dim, n_prototypes, n_components, dropout)

Initialize ProtoGate architecture.

Parameters:

  • input_dim (int) – Input feature dimension

  • output_dim (int) – Output dimension

  • n_prototypes (int) – Number of learned prototypes

  • n_components (int) – Number of components per prototype

  • dropout (float) – Dropout probability

forward(x)

Forward pass with prototype-based gating.

Prototype-Based Processing:

  1. Prototype Computation: Learn representative prototypes from data

  2. Distance Calculation: Compute distances to prototypes

  3. Gate Generation: Use distances to generate feature gates

  4. Feature Selection: Apply gates for adaptive feature selection

class TALENT.model.models.protogate.GatingNet

Gating network for prototype-based feature selection.

hard_sigmoid(x)

Hard sigmoid activation for efficient gating.

Hard Sigmoid Mathematical Definition:

\[\text{hard_sigmoid}(x) = \max(0, \min(1, \frac{x + 1}{2}))\]

This provides a piecewise linear approximation to the sigmoid function for computational efficiency.

forward(x)

Generate gating weights for feature selection.

Retrieval-Based Models

TabR (Tabular Retrieval)

class TALENT.model.models.tabr.TabR(*args: Any, **kwargs: Any)

Bases: Module

forward(*, x_num: torch.Tensor, x_cat: Optional[torch.Tensor], y: Optional[torch.Tensor], candidate_x_num: Optional[torch.Tensor], candidate_x_cat: Optional[torch.Tensor], candidate_y: torch.Tensor, context_size: int, is_train: bool) torch.Tensor
reset_parameters()
class TALENT.model.models.tabr.TabR

KNN-attention hybrid model with retrieval-based predictions.

__init__(n_num_features, n_cat_features, n_classes, context_size, normalization, num_embeddings, d_main, d_multiplier, encoder_n_blocks, predictor_n_blocks, mixer_normalization, dropout0, dropout1, normalization, activation)

Initialize TabR architecture.

Parameters:

  • n_num_features (int) – Number of numerical features

  • n_cat_features (int) – Number of categorical features

  • n_classes (int) – Number of output classes

  • context_size (int) – Maximum context size for retrieval

  • normalization (str) – Normalization type

  • num_embeddings (dict) – Embedding configurations

  • d_main (int) – Main hidden dimension

  • d_multiplier (int) – Dimension multiplier

  • encoder_n_blocks (int) – Number of encoder blocks

  • predictor_n_blocks (int) – Number of predictor blocks

  • mixer_normalization (str) – Mixer normalization type

  • dropout0 (float) – Input dropout

  • dropout1 (float) – Hidden dropout

  • activation (str) – Activation function

forward(x_num, x_cat, candidate_x_num, candidate_x_cat, candidate_y, context_size, is_train)

Forward pass with retrieval-based attention.

Retrieval Process:

  1. Context Selection: Select relevant examples from training set

  2. Attention Computation: Apply attention over retrieved candidates

  3. Feature Processing: Process query and candidate features

  4. Prediction Generation: Combine retrieval and learned representations

Foundation Models

TabPFN (Tabular Prior-Fitting Networks)

class TALENT.model.models.tabpfn.TabPFNClassifier(*args: Any, **kwargs: Any)

Bases: BaseEstimator, ClassifierMixin

Initializes the classifier and loads the model. Depending on the arguments, the model is either loaded from memory, from a file, or downloaded from the repository if no model is found.

Can also be used to compute gradients with respect to the inputs X_train and X_test. Therefore no_grad has to be set to False and no_preprocessing_mode must be True. Furthermore, X_train and X_test need to be given as torch.Tensors and their requires_grad parameter must be set to True.

Parameters
  • device – If the model should run on cuda or cpu.

  • base_path – Base path of the directory, from which the folders like models_diff can be accessed.

  • model_string – Name of the model. Used first to check if the model is already in memory, and if not, tries to load a model with that name from the models_diff directory. It looks for files named as follows: “prior_diff_real_checkpoint” + model_string + “_n_0_epoch_e.cpkt”, where e can be a number between 100 and 0, and is checked in a descending order.

  • N_ensemble_configurations – The number of ensemble configurations used for the prediction. Thereby the accuracy, but also the running time, increases with this number.

  • no_preprocess_mode – Specifies whether preprocessing is to be performed.

  • multiclass_decoder – If set to permutation, randomly shifts the classes for each ensemble configuration.

  • feature_shift_decoder – If set to true shifts the features for each ensemble configuration according to a random permutation.

  • only_inference – Indicates if the model should be loaded to only restore inference capabilities or also training capabilities. Note that the training capabilities are currently not being fully restored.

  • seed – Seed that is used for the prediction. Allows for a deterministic behavior of the predictions.

  • batch_size_inference – This parameter is a trade-off between performance and memory consumption. The computation done with different values for batch_size_inference is the same, but it is split into smaller/larger batches.

  • no_grad – If set to false, allows for the computation of gradients with respect to X_train and X_test. For this to correctly function no_preprocessing_mode must be set to true.

  • subsample_features – If set to true and the number of features in the dataset exceeds self.max_features (100), the features are subsampled to self.max_features.

fit(X, y, overwrite_warning=False)

Validates the training set and stores it.

If clf.no_grad (default is True): X, y should be of type np.array else: X should be of type torch.Tensors (y can be np.array or torch.Tensor)

load_result_minimal(path, i, e)
models_in_memory = {}
predict(X, return_winning_probability=False, normalize_with_test=False)
predict_proba(X, normalize_with_test=False, return_logits=False)

Predict the probabilities for the input X depending on the training set previously passed in the method fit.

If no_grad is true in the classifier the function takes X as a numpy.ndarray. If no_grad is false X must be a torch tensor and is not fully checked.

remove_models_from_memory()
class TALENT.model.models.tabpfn.TabPFNClassifier

Prior-fitting network for zero-shot tabular classification.

__init__(device, base_path)

Initialize TabPFN with pre-trained weights.

Foundation Model Features:

  • Pre-trained on diverse tabular datasets

  • No gradient-based training required

  • Immediate deployment capability

  • Context-based learning from examples

fit(X, y)

Fit the model using in-context learning (no parameter updates).

In-Context Learning Process:

  1. Context Setup: Store training examples as context

  2. No Weight Updates: Model weights remain frozen

  3. Context Encoding: Encode training data for reference

predict_proba(X)

Make predictions using in-context learning.

Zero-Shot Prediction:

  1. Context Retrieval: Use stored training context

  2. Attention Mechanism: Apply attention over training examples

  3. Prediction Generation: Generate predictions without fine-tuning

Regularization Methods

TANGOS Regularization

class TALENT.model.models.tangos.Tangos(*args: Any, **kwargs: Any)

Bases: Module

cal_representation(x)
cal_tangos_loss(x)
forward(x, x_cat)
class TALENT.model.models.tangos.Tangos

MLP with TANGOS regularization for neuron specialization.

Mathematical Formulation:

TANGOS applies spatial and spectral regularization to encourage neuron specialization:

\[\mathcal{L}_{\text{TANGOS}} = \mathcal{L}_{\text{task}} + \lambda_1 \mathcal{L}_{\text{spatial}} + \lambda_2 \mathcal{L}_{\text{spectral}}\]
__init__(d_in, d_out, d_layers, dropout, lambda1, lambda2)

Initialize TANGOS-regularized MLP.

Parameters:

  • d_in (int) – Input dimension

  • d_out (int) – Output dimension

  • d_layers (List[int]) – Hidden layer dimensions

  • dropout (float) – Dropout probability

  • lambda1 (float) – Spatial regularization weight

  • lambda2 (float) – Spectral regularization weight

forward(x, x_cat=None)

Forward pass with standard MLP architecture.

cal_representation(x)

Calculate intermediate representations for regularization.

Parameters:

  • x (torch.Tensor) – Input features

Returns:

  • torch.Tensor – Hidden representations before final layer

Representation Extraction Process:

The method extracts intermediate representations by stopping before the final layer:

for i, layer in enumerate(self.layers):
    x = layer(x)
    x = F.relu(x)
    if self.dropout and i != len(self.layers) - 1:
        x = F.dropout(x, self.dropout, self.training)
return x  # Return before final head layer

Regularization Applications:

  • Spatial Regularization: Encourages spatial locality in neuron activations

  • Spectral Regularization: Promotes spectral diversity in learned representations

Activation Functions Reference

Standard Activations:

\[\text{ReLU}(x) = \max(0, x)\]
\[\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]\]
\[\begin{split}\text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\end{split}\]

Gated Activations:

\[\text{ReGLU}(x) = a \cdot \text{ReLU}(b) \text{ where } [a, b] = \text{split}(x)\]
\[\text{GeGLU}(x) = a \cdot \text{GELU}(b) \text{ where } [a, b] = \text{split}(x)\]

Probability Functions:

\[\text{Softmax}(x_i) = \frac{\exp(x_i)}{\sum_{j=1}^K \exp(x_j)}\]
\[\text{Sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} ||p - z||_2^2\]

where \(\Delta^{K-1}\) is the probability simplex.

Model Usage Examples

Basic MLP Usage:

from TALENT.model.models.mlp import MLP

# Initialize MLP
model = MLP(
    d_in=10,           # Input dimension
    d_out=3,           # Output dimension (3 classes)
    d_layers=[64, 32], # Hidden layer sizes
    dropout=0.1        # Dropout probability
)

# Forward pass
x = torch.randn(32, 10)  # Batch of 32 samples, 10 features
output = model(x)        # Shape: (32, 3)

ResNet with Advanced Activations:

from TALENT.model.models.resnet import ResNet

# Initialize ResNet with GeGLU activation
model = ResNet(
    d_in=15,
    d_out=1,                    # Regression task
    d=128,                      # Hidden dimension
    d_hidden_factor=2.0,        # Hidden expansion factor
    n_layers=4,                 # Number of residual blocks
    activation='geglu',         # GeGLU activation
    normalization='layernorm',  # Layer normalization
    hidden_dropout=0.1,
    residual_dropout=0.1
)

FT-Transformer with Mixed Features:

from TALENT.model.models.ftt import Transformer

# Initialize FT-Transformer
model = Transformer(
    d_numerical=8,          # 8 numerical features
    categories=[5, 10, 3],  # 3 categorical features with cardinalities
    d_token=64,             # Token dimension
    n_layers=3,             # Number of transformer layers
    n_heads=8,              # Attention heads
    d_ffn_factor=2.0,       # FFN expansion factor
    attention_dropout=0.1,
    ffn_dropout=0.1,
    residual_dropout=0.1,
    activation='reglu',
    prenormalization=True,
    d_out=5                 # 5 classes
)

TabNet for Interpretable Classification:

from TALENT.model.models.tabnet import TabNetClassifier

# Initialize TabNet
model = TabNetClassifier(
    n_steps=3,              # Decision steps
    gamma=1.3,              # Relaxation parameter
    n_independent=2,        # Independent GLU layers
    n_shared=2,             # Shared GLU layers
    momentum=0.02,          # Batch norm momentum
    lambda_sparse=1e-3      # Sparsity regularization
)

# Training
model.fit(X_train, y_train,
          eval_set=[(X_val, y_val)],
          max_epochs=100)

# Get predictions and explanations
predictions = model.predict_proba(X_test)
explanations = model.explain(X_test, normalize=True)

GRANDE for Tree-like Neural Networks:

from TALENT.model.models.grande import GRANDE

# Initialize GRANDE
model = GRANDE(
    batch_size=64,
    task_type='classification',
    depth=4,              # Tree depth
    n_estimators=10,      # Number of trees
    dropout=0.1
)

ModernNCA with Distance-Based Learning:

from TALENT.model.models.modernNCA import ModernNCA

# Initialize ModernNCA
model = ModernNCA(
    d_in=15,
    d_out=4,              # 4 classes
    k=32,                 # Number of neighbors
    dropout=0.1,
    d_embedding=64        # Embedding dimension
)

# Training requires candidate examples
output = model(x, y, candidate_x, candidate_y, is_train=True)

ExcelFormer with Mixup Training:

from TALENT.model.models.excelformer import ExcelFormer

# Initialize ExcelFormer
model = ExcelFormer(
    d_numerical=10,
    d_token=64,
    n_blocks=3,
    attention_dropout=0.1,
    ffn_dropout=0.1,
    d_out=3
)

# Forward pass with feature mixup
output, masks, shuffled_ids = model(
    x_num,
    mix_up=True,
    beta=0.5,
    mtype='feat_mix'
)

TabPFN for Zero-Shot Learning:

from TALENT.model.models.tabpfn import TabPFNClassifier

# Initialize pre-trained TabPFN
model = TabPFNClassifier(device='cuda')

# No training required - just fit context
model.fit(X_train, y_train)

# Immediate predictions
predictions = model.predict_proba(X_test)

Model Selection Guidelines

For Beginners: - MLP: Simple, fast, good baseline - ResNet: Better than MLP for deeper networks

For Best Performance: - FT-Transformer: State-of-the-art on many datasets - TabNet: Excellent performance with interpretability - ModernNCA: Strong embedding-based performance

For Interpretability: - TabNet: Attention-based feature importance - GRANDE: Tree-like decision process - ProtoGate: Prototype-based explanations

For Speed: - MLP: Fastest training and inference - SNN: Lightweight with self-normalization - TabPFN: No training required

For Specific Scenarios: - TabR: Retrieval-based learning - ExcelFormer: Complex feature interactions with mixup - TANGOS: When regularization is critical