Models

Deep learning models for tabular data, implementing various state-of-the-art architectures.

This section contains all the neural network architectures implemented in TALENT, ranging from simple MLPs to advanced transformer-based models specifically designed for tabular data. Each model implements specific forward pass computations, mathematical operations, and architectural innovations.

Basic Neural Networks

Multi-Layer Perceptron (MLP)

class TALENT.model.models.mlp.MLP(*args: Any, **kwargs: Any)

Bases: Module

forward(x, x_cat=None)

class TALENT.model.models.mlp.MLP

Simple feedforward neural network with multiple fully connected layers and ReLU activations.

Mathematical Formulation:

For input \(x \in \mathbb{R}^{d_{in}}\), the MLP computes:

\[\begin{split}h_0 &= x \\ h_i &= \text{ReLU}(\text{Linear}(h_{i-1})) = \text{ReLU}(W_i h_{i-1} + b_i) \\ \text{output} &= W_{\text{head}} h_L + b_{\text{head}}\end{split}\]

where \(L\) is the number of hidden layers.

__init__(d_in, d_out, d_layers, dropout)

Initialize the MLP architecture.

Parameters:

d_in (int) – Input feature dimension
d_out (int) – Output dimension (number of classes for classification, 1 for regression)
d_layers (List[int]) – Hidden layer dimensions, e.g., [64, 32] for two hidden layers
dropout (float) – Dropout probability applied after each hidden layer

Architecture Construction:

Hidden Layers: Creates nn.Linear layers with dimensions specified in d_layers
Output Head: Final linear layer mapping to output dimension
Dropout Setup: Configures dropout for regularization during training

forward(x, x_cat=None)

Forward pass through the MLP network.

Parameters:

x (torch.Tensor) – Input numerical features of shape (batch_size, d_in)
x_cat (torch.Tensor, optional) – Categorical features (not used in MLP, maintained for interface consistency)

Returns:

torch.Tensor – Output predictions of shape (batch_size, d_out) or (batch_size,) for regression

Forward Pass Implementation:

for layer in self.layers:
    x = layer(x)  # Linear: x = W @ x + b
    x = F.relu(x)  # ReLU: x = max(0, x)
    if self.dropout:
        x = F.dropout(x, self.dropout, self.training)

logit = self.head(x)  # Final output layer
if self.d_out == 1:
    logit = logit.squeeze(-1)  # For regression

ReLU Activation:

\[\text{ReLU}(x) = \max(0, x)\]

Dropout Regularization:

During training, randomly sets elements to zero with probability dropout:

\[\begin{split}\text{Dropout}(x) = \begin{cases} \frac{x}{1-p} & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}\end{split}\]

Residual Network (ResNet)

class TALENT.model.models.resnet.ResNet(*args: Any, **kwargs: Any)

Bases: Module

forward(x: torch.Tensor, x_cat: torch.Tensor) → torch.Tensor

TALENT.model.models.resnet.geglu(x)

TALENT.model.models.resnet.get_activation_fn(name)

TALENT.model.models.resnet.get_nonglu_activation_fn(name)

TALENT.model.models.resnet.reglu(x)

class TALENT.model.models.resnet.ResNet

Deep residual network with skip connections for tabular data, preventing gradient vanishing in deep architectures.

Mathematical Formulation:

ResNet uses residual blocks with skip connections:

\[h_{i+1} = h_i + F(h_i, W_i)\]

where \(F(h_i, W_i)\) is the residual function.

__init__(d_in, d, d_hidden_factor, n_layers, activation, normalization, hidden_dropout, residual_dropout, d_out)

Initialize the ResNet architecture with configurable components.

Parameters:

d_in (int) – Input feature dimension
d (int) – Hidden dimension for residual blocks
d_hidden_factor (float) – Factor to scale hidden layer width within blocks
n_layers (int) – Number of residual blocks
activation (str) – Activation function (‘relu’, ‘gelu’, ‘reglu’, ‘geglu’)
normalization (str) – Normalization type (‘batchnorm’, ‘layernorm’)
hidden_dropout (float) – Dropout probability within residual blocks
residual_dropout (float) – Dropout probability for residual connections
d_out (int) – Output dimension

forward(x, x_cat=None)

Forward pass through the ResNet architecture.

Parameters:

x (torch.Tensor) – Input numerical features
x_cat (torch.Tensor, optional) – Categorical features (not used)

Returns:

torch.Tensor – Output predictions

Residual Block Mathematical Implementation:

For each residual block, the computation follows:

\[\begin{split}\text{residual} &= \text{Norm}(h_i) \\ \text{residual} &= \text{Linear}(\text{residual}) \\ \text{residual} &= \text{Activation}(\text{residual}) \\ \text{residual} &= \text{Dropout}(\text{residual}) \\ \text{residual} &= \text{Linear}(\text{residual}) \\ \text{residual} &= \text{Dropout}(\text{residual}) \\ h_{i+1} &= h_i + \text{residual}\end{split}\]

Activation Functions:

ReLU: \(\text{ReLU}(x) = \max(0, x)\)
GELU: \(\text{GELU}(x) = x \cdot \Phi(x)\)
ReGLU: \(\text{ReGLU}(x) = a \cdot \text{ReLU}(b)\) where \(a, b = \text{split}(x)\)
GeGLU: \(\text{GeGLU}(x) = a \cdot \text{GELU}(b)\) where \(a, b = \text{split}(x)\)

reglu(x)

ReGLU activation function for gated linear units.

Mathematical Definition:

\[\text{ReGLU}(x) = a \cdot \text{ReLU}(b)\]

where \(a\) and \(b\) are obtained by splitting \(x\) along the last dimension.

geglu(x)

GeGLU activation function combining gating with GELU.

Mathematical Definition:

\[\text{GeGLU}(x) = a \cdot \text{GELU}(b)\]

where \(a\) and \(b\) are obtained by splitting \(x\) along the last dimension.

Self-Normalizing Network (SNN)

class TALENT.model.models.snn.SNN(*args: Any, **kwargs: Any)

Bases: Module

calculate_output(x: torch.Tensor) → torch.Tensor

property d_embedding: int

encode(x_num, x_cat)

forward(x_num: torch.Tensor, x_cat) → torch.Tensor

class TALENT.model.models.snn.SNN

Lightweight neural network with self-normalizing properties using SELU activation.

__init__(d_in, d_out, d_layers, dropout)

Initialize SNN with SELU activations for self-normalization.

Parameters:

d_in (int) – Input dimension
d_out (int) – Output dimension
d_layers (List[int]) – Hidden layer dimensions
dropout (float) – Dropout probability

forward(x, x_cat=None)

Forward pass with SELU activation for self-normalization.

SELU Activation Mathematical Definition:

\[\begin{split}\text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\end{split}\]

where \(\lambda \approx 1.0507\) and \(\alpha \approx 1.6733\).

Self-Normalization Property:

SELU ensures that for normalized inputs, activations maintain: - Mean converges to 0 - Variance converges to 1 - Enables training of very deep networks without explicit normalization

Transformer-Based Models

Feature Tokenizer Transformer (FT-Transformer)

class TALENT.model.models.ftt.MultiheadAttention(*args: Any, **kwargs: Any)

Bases: Module

forward(x_q: torch.Tensor, x_kv: torch.Tensor, key_compression: Optional[torch.nn.Linear], value_compression: Optional[torch.nn.Linear]) → torch.Tensor

class TALENT.model.models.ftt.Tokenizer(*args: Any, **kwargs: Any)

Bases: Module

category_offsets: Optional[torch.Tensor]

forward(x_num: torch.Tensor, x_cat: Optional[torch.Tensor]) → torch.Tensor

property n_tokens: int

class TALENT.model.models.ftt.Transformer(*args: Any, **kwargs: Any)

Bases: Module

Transformer.

References: - https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html - https://github.com/facebookresearch/pytext/tree/master/pytext/models/representations/transformer - https://github.com/pytorch/fairseq/blob/1bba712622b8ae4efb3eb793a8a40da386fe11d0/examples/linformer/linformer_src/modules/multihead_linear_attention.py#L19

forward(x_num: torch.Tensor, x_cat: Optional[torch.Tensor]) → torch.Tensor

TALENT.model.models.ftt.geglu(x)

TALENT.model.models.ftt.get_activation_fn(name)

TALENT.model.models.ftt.get_nonglu_activation_fn(name)

TALENT.model.models.ftt.reglu(x)

class TALENT.model.models.ftt.Transformer

Advanced transformer architecture specifically designed for tabular data with feature tokenization.

Mathematical Formulation:

Feature Tokenization:

For numerical features: \(t_i^{\text{num}} = W_{\text{num}} x_i + b_{\text{num}}\)

For categorical features: \(t_i^{\text{cat}} = \text{Embedding}(x_i^{\text{cat}})\)

__init__(d_numerical, categories, d_token, n_layers, n_heads, d_ffn_factor, attention_dropout, ffn_dropout, residual_dropout, activation, prenormalization, d_out)

Initialize the FT-Transformer architecture.

Parameters:

d_numerical (int) – Number of numerical features
categories (List[int], optional) – Cardinalities for categorical features
d_token (int) – Token embedding dimension
n_layers (int) – Number of transformer layers
n_heads (int) – Number of attention heads
d_ffn_factor (float) – Factor for feed-forward network dimension
attention_dropout (float) – Dropout for attention weights
ffn_dropout (float) – Dropout for feed-forward network
residual_dropout (float) – Dropout for residual connections
activation (str) – Activation function for FFN
prenormalization (bool) – Whether to use pre-normalization
d_out (int) – Output dimension

forward(x_num, x_cat)

Forward pass through the transformer.

Parameters:

x_num (torch.Tensor, optional) – Numerical features of shape (batch_size, d_numerical)
x_cat (torch.Tensor, optional) – Categorical features of shape (batch_size, n_categorical)

Returns:

torch.Tensor – Output predictions

Transformer Processing Pipeline:

Tokenization: Convert features to tokens using Tokenizer
CLS Token Addition: Prepend classification token
Transformer Layers: Apply multi-head attention and feed-forward networks
Output Generation: Use CLS token representation for final prediction

Transformer Layer Mathematical Implementation:

For each transformer layer:

\[\begin{split}\text{attn_out} &= \text{MultiHeadAttention}(x, x, x) \\ x &= \text{LayerNorm}(x + \text{attn_out}) \\ \text{ffn_out} &= \text{FFN}(x) \\ x &= \text{LayerNorm}(x + \text{ffn_out})\end{split}\]

class TALENT.model.models.ftt.Tokenizer

Converts numerical and categorical features into token embeddings for transformer processing.

__init__(d_numerical, categories, d_token, bias)

Initialize the feature tokenizer.

Parameters:

d_numerical (int) – Number of numerical features
categories (List[int], optional) – Cardinalities of categorical features
d_token (int) – Token embedding dimension
bias (bool) – Whether to use bias in tokenization

forward(x_num, x_cat)

Convert features to token embeddings.

Tokenization Process:

Numerical Features:

\[\text{tokens}_{\text{num}} = x_{\text{num}} W_{\text{num}} + b_{\text{num}}\]

Categorical Features:

\[\text{tokens}_{\text{cat}} = \text{Embedding}(x_{\text{cat}} + \text{offsets})\]

CLS Token:

\[\text{tokens}_{\text{cls}} = W_{\text{cls}}\]

property n_tokens

Total number of tokens (numerical + categorical + CLS).

Returns:

int – Total token count

class TALENT.model.models.ftt.MultiheadAttention

Multi-head attention mechanism optimized for tabular data.

__init__(d, n_heads, dropout, bias)

Initialize multi-head attention.

Parameters:

d (int) – Input dimension
n_heads (int) – Number of attention heads
dropout (float) – Attention dropout probability
bias (bool) – Whether to use bias in projections

forward(x_q, x_kv, key_compression, value_compression)

Compute multi-head attention.

Parameters:

x_q (torch.Tensor) – Query input
x_kv (torch.Tensor) – Key and value input
key_compression (nn.Linear, optional) – Key compression layer
value_compression (nn.Linear, optional) – Value compression layer

Returns:

torch.Tensor – Attention output

Multi-Head Attention Mathematical Implementation:

Linear Projections:

\[Q = x_q W^Q, \quad K = x_{kv} W^K, \quad V = x_{kv} W^V\]
Scaled Dot-Product Attention:

\[\text{attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)\]
Output Computation:

\[\text{output} = \text{attention} \cdot V\]
Multi-Head Combination:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O\]

Advanced Tabular Models

TabNet

class TALENT.model.models.tabnet.TabNetClassifier(*args: Any, **kwargs: Any)

Bases: TabModel

compute_loss(y_pred, y_true)

Compute the loss.

Parameters

y_score (a :tensor: torch.Tensor) – Score matrix
y_true (a :tensor: torch.Tensor) – Target matrix

Returns

Loss value

Return type

float

predict_func(outputs)

predict_proba(X)

Make predictions for classification on a batch (valid)

Parameters: X (a :tensor: torch.Tensor or matrix: scipy.sparse.csr_matrix) – Input data
Returns: res
Return type: np.ndarray

prepare_target(y)

Prepare target before training.

Parameters: y (a :tensor: torch.Tensor) – Target matrix.
Returns: Converted target matrix.
Return type: torch.Tensor

stack_batches(list_y_true, list_y_score)

update_fit_params(X_train, y_train, eval_set, weights)

Set attributes relative to fit function.

Parameters

X_train (np.ndarray) – Train set
y_train (np.array) – Train targets
eval_set (list of tuple) – List of eval tuple set (X, y).
weights (bool or dictionnary) – 0 for no balancing 1 for automated balancing

weight_updater(weights)

Updates weights dictionary according to target_mapper.

Parameters: weights (bool or dict) – Given weights for balancing training.
Returns: Same bool if weights are bool, updated dict otherwise.
Return type: bool or dict

class TALENT.model.models.tabnet.TabNetRegressor(*args: Any, **kwargs: Any)

Bases: TabModel

compute_loss(y_pred, y_true)

Compute the loss.

Parameters

y_score (a :tensor: torch.Tensor) – Score matrix
y_true (a :tensor: torch.Tensor) – Target matrix

Returns

Loss value

Return type

float

predict_func(outputs)

prepare_target(y)

Prepare target before training.

Parameters: y (a :tensor: torch.Tensor) – Target matrix.
Returns: Converted target matrix.
Return type: torch.Tensor

stack_batches(list_y_true, list_y_score)

update_fit_params(X_train, y_train, eval_set, weights)

Set attributes relative to fit function.

Parameters

X_train (np.ndarray) – Train set
y_train (np.array) – Train targets
eval_set (list of tuple) – List of eval tuple set (X, y).
weights (bool or dictionnary) – 0 for no balancing 1 for automated balancing

class TALENT.model.models.tabnet.TabNetClassifier

Interpretable deep learning model with sequential attention mechanism for classification.

Mathematical Formulation:

TabNet uses sequential feature selection through sparsemax attention:

Feature Selection at Step i:

\[M^{[i]} = \text{sparsemax}(\text{AttentionTransformer}(f^{[i-1]}))\]

Feature Processing:

\[f^{[i]} = \gamma \odot M^{[i]} \odot h + (1-\gamma) \odot f^{[i-1]}\]

where \(\gamma\) is the relaxation parameter.

__init__(n_steps, gamma, n_independent, n_shared, momentum, optimizer_params, scheduler_params, mask_type, lambda_sparse, seed)

Initialize TabNet classifier.

Parameters:

n_steps (int) – Number of decision steps
gamma (float) – Relaxation parameter for feature selection
n_independent (int) – Number of independent GLU layers per step
n_shared (int) – Number of shared GLU layers
momentum (float) – Momentum for batch normalization
optimizer_params (dict) – Optimizer configuration
scheduler_params (dict) – Learning rate scheduler parameters
mask_type (str) – Type of attention mask (‘sparsemax’ or ‘entmax’)
lambda_sparse (float) – Sparsity regularization coefficient
seed (int) – Random seed

fit(X_train, y_train, eval_set, eval_name, eval_metric, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last, callbacks)

Train the TabNet model.

Training Process:

Data Preprocessing: Handle categorical encoding and normalization
Sequential Training: Train each decision step sequentially
Attention Regularization: Apply sparsity constraints on attention masks
Early Stopping: Monitor validation metrics for convergence

predict_proba(X)

Make probability predictions for classification.

Parameters:

X (torch.Tensor or scipy.sparse matrix) – Input features

Returns:

np.ndarray – Class probabilities of shape (n_samples, n_classes)

Prediction Process:

Forward Pass: Process through all decision steps
Attention Aggregation: Combine attention from all steps
Softmax Application: Convert logits to probabilities

\[P(y=k|x) = \frac{\exp(o_k)}{\sum_{j=1}^K \exp(o_j)}\]

where \(o_k\) is the raw output for class \(k\).

explain(X, normalize)

Generate feature importance explanations using attention masks.

Parameters:

X (torch.Tensor) – Input features
normalize (bool) – Whether to normalize importance scores

Returns:

np.ndarray – Feature importance matrix

Explanation Generation:

Attention masks from each decision step provide interpretable feature importance:

\[\text{importance}_{ij} = \frac{M^{[i]}_j}{\sum_{k=1}^{n_features} M^{[i]}_k}\]

class TALENT.model.models.tabnet.TabNetRegressor

TabNet for regression tasks with mean squared error optimization.

compute_loss(y_pred, y_true)

Compute mean squared error loss for regression.

MSE Loss Mathematical Definition:

\[\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

Tree-Based Neural Models

GRANDE (Gradient-Boosted Neural Decision Ensembles)

class TALENT.model.models.grande.GRANDE(*args: Any, **kwargs: Any)

Bases: Module

apply_preprocessing(X)

build_model()

entmax15(inputs, axis=- 1)

entmax_threshold_and_support(inputs, axis=- 1)

forward(inputs)

preprocess_data(X_train, y_train, X_val, y_val)

set_params(**kwargs)

class TALENT.model.models.grande.GRANDE

Tree-mimic neural network using gradient descent for decision tree simulation.

Mathematical Formulation:

GRANDE simulates decision trees using neural operations with entmax for sparse selection.

__init__(batch_size, task_type, depth, n_estimators, dropout)

Initialize GRANDE model.

Parameters:

batch_size (int) – Training batch size
task_type (str) – ‘classification’ or ‘regression’
depth (int) – Maximum tree depth
n_estimators (int) – Number of tree estimators
dropout (float) – Dropout probability

forward(inputs)

Forward pass through the GRANDE ensemble.

Parameters:

inputs (torch.Tensor) – Input features

Returns:

torch.Tensor – Ensemble predictions

Tree Simulation Mathematical Implementation:

Split Decision Computation:

\[\text{node_result} = \frac{\text{softsign}(s_1 - s_2) + 1}{2}\]

where \(s_1\) are learned split thresholds and \(s_2\) are feature values.
Path Probability Calculation:

\[p = \prod_{j} ((1-\text{path_id}_j) \cdot \text{node_result}_j + \text{path_id}_j \cdot (1-\text{node_result}_j))\]
Ensemble Output for Regression:

\[\text{output} = \sum_{e,l} w_e \cdot p_{e,l} \cdot v_{e,l}\]

where \(w_e\) are estimator weights, \(p_{e,l}\) are leaf probabilities, and \(v_{e,l}\) are leaf values.
Ensemble Output for Classification:

\[\text{output} = \sum_{e,l} w_e \cdot p_{e,l} \cdot \text{softmax}(v_{e,l})\]

get_representation(inputs)

Extract intermediate tree representations for analysis.

Returns:

torch.Tensor – Tree path representations

Neural Oblivious Decision Ensembles (NODE)

class TALENT.model.models.node.NODE(*, d_in: int, num_layers: int, layer_dim: int, depth: int, tree_dim: int, choice_function: str, bin_function: str, d_out: int)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x_num: Tensor, x_cat: Tensor) → Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class TALENT.model.models.node.Node

Neural implementation of oblivious decision trees with differentiable splits.

__init__(input_dim, layer_dim, output_dim, num_layers, tree_dim, depth, choice_function, bin_function)

Initialize NODE architecture.

Parameters:

input_dim (int) – Input feature dimension
layer_dim (int) – Hidden layer dimension
output_dim (int) – Output dimension
num_layers (int) – Number of NODE layers
tree_dim (int) – Number of trees per layer
depth (int) – Tree depth
choice_function (str) – Function for feature selection (‘entmax15’)
bin_function (str) – Function for threshold selection (‘entmoid15’)

forward(x)

Forward pass through oblivious decision trees.

Decision Tree Mathematical Process:

Feature Selection: Use entmax for sparse feature selection
Threshold Comparison: Compare features with learned thresholds
Path Aggregation: Aggregate predictions along tree paths
Ensemble Combination: Combine outputs from multiple trees

GrowNet (Gradient Boosting with Neural Networks)

class TALENT.model.models.grownet.DynamicNet(lr, categories: Optional[List[int]], d_embedding: Optional[int])

Bases: object

add(model)

embed_input(x_num, x_cat)

forward(x_num, x_cat)

forward_grad(x_num, x_cat)

classmethod from_file(path, builder)

parameters()

to_cuda()

to_double()

to_eval()

to_file(path)

to_train()

zero_grad()

class TALENT.model.models.grownet.ForwardType(value)

Bases: Enum

An enumeration.

CASCADE = 2

GRADIENT = 3

SIMPLE = 0

STACKED = 1

class TALENT.model.models.grownet.MLP_2HL(dim_in, dim_hidden1, dim_hidden2, dim_out, sparse=False, bn=True)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x, lower_f)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod get_model(stage, opt)

class TALENT.model.models.grownet.SpLinear(input_features, output_features, bias=True)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class TALENT.model.models.grownet.SpLinearFunc(*args, **kwargs)

Bases: Function

static backward(ctx, grad_output)

Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs as the forward() returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, input, weight, bias=None)

This function is to be overridden by all subclasses. There are two ways to define forward:

Usage 1 (Combined forward and ctx):

@staticmethod
def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    pass

It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).
See combining-forward-context for more details

Usage 2 (Separate forward and ctx):

@staticmethod
def forward(*args: Any, **kwargs: Any) -> Any:
    pass

@staticmethod
def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    pass

The forward no longer accepts a ctx argument.
Instead, you must also override the torch.autograd.Function.setup_context() staticmethod to handle setting up the ctx object. output is the output of the forward, inputs are a Tuple of inputs to the forward.
See extending-autograd for more details

The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with ctx.save_for_backward() if they are intended to be used in backward (equivalently, vjp) or ctx.save_for_forward() if they are intended to be used for in jvp.

class TALENT.model.models.grownet.GrowNet

Gradient boosting framework with neural network weak learners.

__init__(input_dim, output_dim, boost_rate, layers_per_net, layer_dims, dropout)

Initialize GrowNet with neural weak learners.

Gradient Boosting Process:

Weak Learner Training: Train neural networks on residuals
Boosting Update: Add weak learners with adaptive weights
Gradient Computation: Compute gradients for next weak learner

forward(x)

Forward pass through the boosted ensemble.

Boosting Mathematical Formulation:

\[F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)\]

where \(h_m\) is the m-th weak learner and \(\gamma_m\) is the boosting rate.

Distance-Based Models

Modern Neighborhood Component Analysis (ModernNCA)

class TALENT.model.models.modernNCA.MLP_Block(*args: Any, **kwargs: Any)

Bases: Module

forward(x: torch.Tensor) → torch.Tensor

class TALENT.model.models.modernNCA.ModernNCA(*args: Any, **kwargs: Any)

Bases: Module

forward(x, y, candidate_x, candidate_y, is_train)

make_layer()

class TALENT.model.models.modernNCA.ModernNCA

Neighborhood Component Analysis-inspired model for embedding-based predictions.

Mathematical Formulation:

ModernNCA learns embeddings for distance-based classification.

__init__(d_in, d_out, k, dropout, d_embedding)

Initialize ModernNCA model.

Parameters:

d_in (int) – Input feature dimension
d_out (int) – Output dimension (number of classes)
k (int) – Number of nearest neighbors to consider
dropout (float) – Dropout probability
d_embedding (int) – Embedding dimension

forward(x, y, candidate_x, candidate_y, is_train)

Forward pass with neighborhood analysis.

Parameters:

x (torch.Tensor) – Query features
y (torch.Tensor) – Query labels
candidate_x (torch.Tensor) – Candidate features for nearest neighbor search
candidate_y (torch.Tensor) – Candidate labels
is_train (bool) – Training mode flag

Returns:

torch.Tensor – Distance-based predictions

Distance-Based Prediction Mathematical Implementation:

Embedding Computation:

\[e_i = f(x_i), \quad e_j = f(x_j)\]

where \(f\) is the learned embedding function.
Distance Computation:

\[d(x_i, x_j) = ||e_i - e_j||_2\]
Neighbor Weighting:

\[p_{ij} = \frac{\exp(-d(x_i, x_j))}{\sum_{k \neq i} \exp(-d(x_i, x_k))}\]
Final Prediction:

\[\hat{y}_i = \sum_j p_{ij} y_j\]

knn_prediction(x, candidate_x, candidate_y, k)

Make predictions using k-nearest neighbors in embedding space.

K-NN Process:

Distance Calculation: Compute distances in embedding space
Neighbor Selection: Find k nearest neighbors
Prediction Aggregation: Aggregate neighbor labels with distance weighting

Specialized Architectures

ExcelFormer (Semi-Permeable Attention)

class TALENT.model.models.excelformer.ExcelFormer(*, d_numerical: int, token_bias: bool, n_layers: int, d_token: int, n_heads: int, attention_dropout: float, ffn_dropout: float, residual_dropout: float, prenormalization: bool, kv_compression: Optional[float], kv_compression_sharing: Optional[str], d_out: int, init_scale: float = 0.1)

Bases: Module

ExcelFormer with All initialized by small value

initial function: v4

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x_num: Tensor, x_cat=None, mix_up: bool = False, beta=0.5, mtype='feat_mix') → Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class TALENT.model.models.excelformer.MultiheadAttention(d: int, n_heads: int, dropout: float, init_scale: float = 0.01)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x_q: Tensor, x_kv: Tensor, key_compression: Optional[Linear], value_compression: Optional[Linear]) → Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_attention_mask(input_shape, device)

class TALENT.model.models.excelformer.Tokenizer(d_numerical: int, categories: Optional[List[int]], d_token: int, bias: bool)

Bases: Module

Initializes internal Module state, shared by both nn.Module and ScriptModule.

category_offsets: Optional[Tensor]

forward(x_num: Tensor) → Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property n_tokens: int

TALENT.model.models.excelformer.attenuated_kaiming_uniform_(tensor, a=2.23606797749979, scale=1.0, mode='fan_in', nonlinearity='leaky_relu')

class TALENT.model.models.excelformer.ExcelFormer

Transformer with semi-permeable attention and mixup training capabilities.

__init__(d_numerical, d_token, n_blocks, attention_dropout, ffn_dropout, residual_dropout, d_out)

Initialize ExcelFormer architecture.

Parameters:

d_numerical (int) – Number of numerical features
d_token (int) – Token embedding dimension
n_blocks (int) – Number of transformer blocks
attention_dropout (float) – Attention dropout probability
ffn_dropout (float) – Feed-forward dropout probability
residual_dropout (float) – Residual connection dropout
d_out (int) – Output dimension

forward(x_num, x_cat, mix_up, beta, mtype)

Forward pass with optional mixup augmentation.

Parameters:

x_num (torch.Tensor) – Numerical features
x_cat (torch.Tensor, optional) – Categorical features
mix_up (bool) – Whether to apply mixup
beta (float) – Mixup parameter (default: 0.5)
mtype (str) – Mixup type (‘feat_mix’, ‘hidden_mix’, ‘naive_mix’)

Returns:

tuple – (output, feat_masks, shuffled_ids) for mixup training

Mixup Mathematical Implementation:

Feature Mixup:

\[\tilde{x} = \lambda x_i + (1-\lambda) x_j\]

Semi-Permeable Attention:

\[\text{Attention}_{\text{perm}}(Q, K, V) = \text{mask} \odot \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

mixup_process(x, beta, mtype)

Apply mixup augmentation to input features.

Mixup Types:

feat_mix: Feature-level mixing with learnable weights
hidden_mix: Hidden representation mixing
naive_mix: Simple linear interpolation

ProtoGate (Prototype-Based Gating)

class TALENT.model.models.protogate.DeactFunc(*args: Any, **kwargs: Any)

Bases: Module

forward(x)

class TALENT.model.models.protogate.GatingNet(*args: Any, **kwargs: Any)

Bases: Module

Gating Network for feature selection

Parameters

input_dim (int) – input dimension of the gating network
a (float) – coefficient in hard relu activation function
sigma (float) – std of the gaussion reparameterization noise
activation (str) – activation function of the gating net: ‘relu’, ‘l_relu’, ‘sigmoid’, ‘tanh’, or ‘none’
hidden_layer_list (list) – number of nodes for each hidden layer of the gating net, example: [200,200]

forward(x)

get_stochastic_gate(alpha): This function replaced the feature_selector function in order to save Z

hard_sigmoid(x)

Segment-wise linear approximation of sigmoid. Faster than sigmoid. Returns 0. if x < -2.5, 1. if x > 2.5. In -2.5 <= x <= 2.5, returns 0.2 * x + 0.5. # Arguments

x: A tensor or variable.

# Returns: A tensor.

class TALENT.model.models.protogate.HybridSort(*args: Any, **kwargs: Any)

Bases: Module

forward(scores: torch.Tensor): scores: elements to be sorted. Typical shape: batch_size x n x 1

class TALENT.model.models.protogate.KNNNet(*args: Any, **kwargs: Any)

Bases: Module

forward(query, neighbors, tau=1.0)

class TALENT.model.models.protogate.PL(*args: Any, **kwargs: Any)

Bases: Distribution

scores. Shape: (batch_size x) n tau: temperature for the relaxation. Scalar. hard: use straight-through estimation if True

arg_constraints = {'scores': torch.distributions.constraints.positive, 'tau': torch.distributions.constraints.positive}

has_rsample = True

log_prob(value): value: permutation matrix. shape: batch_size x n x n

property mean

relaxed_sort(inp): inp: elements to be sorted. Typical shape: batch_size x n x 1

rsample(sample_shape, log_score=True): sample_shape: number of samples from the PL distribution. Scalar.

TALENT.model.models.protogate.get_activation(value)

class TALENT.model.models.protogate.ProtoGate

Prototype-based model with gating mechanisms for interpretable feature selection.

__init__(input_dim, output_dim, n_prototypes, n_components, dropout)

Initialize ProtoGate architecture.

Parameters:

input_dim (int) – Input feature dimension
output_dim (int) – Output dimension
n_prototypes (int) – Number of learned prototypes
n_components (int) – Number of components per prototype
dropout (float) – Dropout probability

forward(x)

Forward pass with prototype-based gating.

Prototype-Based Processing:

Prototype Computation: Learn representative prototypes from data
Distance Calculation: Compute distances to prototypes
Gate Generation: Use distances to generate feature gates
Feature Selection: Apply gates for adaptive feature selection

class TALENT.model.models.protogate.GatingNet

Gating network for prototype-based feature selection.

hard_sigmoid(x)

Hard sigmoid activation for efficient gating.

Hard Sigmoid Mathematical Definition:

\[\text{hard_sigmoid}(x) = \max(0, \min(1, \frac{x + 1}{2}))\]

This provides a piecewise linear approximation to the sigmoid function for computational efficiency.

forward(x): Generate gating weights for feature selection.

Retrieval-Based Models

TabR (Tabular Retrieval)

class TALENT.model.models.tabr.TabR(*args: Any, **kwargs: Any)

Bases: Module

forward(*, x_num: torch.Tensor, x_cat: Optional[torch.Tensor], y: Optional[torch.Tensor], candidate_x_num: Optional[torch.Tensor], candidate_x_cat: Optional[torch.Tensor], candidate_y: torch.Tensor, context_size: int, is_train: bool) → torch.Tensor

reset_parameters()

class TALENT.model.models.tabr.TabR

KNN-attention hybrid model with retrieval-based predictions.

__init__(n_num_features, n_cat_features, n_classes, context_size, normalization, num_embeddings, d_main, d_multiplier, encoder_n_blocks, predictor_n_blocks, mixer_normalization, dropout0, dropout1, normalization, activation)

Initialize TabR architecture.

Parameters:

n_num_features (int) – Number of numerical features
n_cat_features (int) – Number of categorical features
n_classes (int) – Number of output classes
context_size (int) – Maximum context size for retrieval
normalization (str) – Normalization type
num_embeddings (dict) – Embedding configurations
d_main (int) – Main hidden dimension
d_multiplier (int) – Dimension multiplier
encoder_n_blocks (int) – Number of encoder blocks
predictor_n_blocks (int) – Number of predictor blocks
mixer_normalization (str) – Mixer normalization type
dropout0 (float) – Input dropout
dropout1 (float) – Hidden dropout
activation (str) – Activation function

forward(x_num, x_cat, candidate_x_num, candidate_x_cat, candidate_y, context_size, is_train)

Forward pass with retrieval-based attention.

Retrieval Process:

Context Selection: Select relevant examples from training set
Attention Computation: Apply attention over retrieved candidates
Feature Processing: Process query and candidate features
Prediction Generation: Combine retrieval and learned representations

Foundation Models

TabPFN (Tabular Prior-Fitting Networks)

class TALENT.model.models.tabpfn.TabPFNClassifier(*args: Any, **kwargs: Any)

Bases: BaseEstimator, ClassifierMixin

Initializes the classifier and loads the model. Depending on the arguments, the model is either loaded from memory, from a file, or downloaded from the repository if no model is found.

Can also be used to compute gradients with respect to the inputs X_train and X_test. Therefore no_grad has to be set to False and no_preprocessing_mode must be True. Furthermore, X_train and X_test need to be given as torch.Tensors and their requires_grad parameter must be set to True.

Parameters

device – If the model should run on cuda or cpu.
base_path – Base path of the directory, from which the folders like models_diff can be accessed.
model_string – Name of the model. Used first to check if the model is already in memory, and if not, tries to load a model with that name from the models_diff directory. It looks for files named as follows: “prior_diff_real_checkpoint” + model_string + “_n_0_epoch_e.cpkt”, where e can be a number between 100 and 0, and is checked in a descending order.
N_ensemble_configurations – The number of ensemble configurations used for the prediction. Thereby the accuracy, but also the running time, increases with this number.
no_preprocess_mode – Specifies whether preprocessing is to be performed.
multiclass_decoder – If set to permutation, randomly shifts the classes for each ensemble configuration.
feature_shift_decoder – If set to true shifts the features for each ensemble configuration according to a random permutation.
only_inference – Indicates if the model should be loaded to only restore inference capabilities or also training capabilities. Note that the training capabilities are currently not being fully restored.
seed – Seed that is used for the prediction. Allows for a deterministic behavior of the predictions.
batch_size_inference – This parameter is a trade-off between performance and memory consumption. The computation done with different values for batch_size_inference is the same, but it is split into smaller/larger batches.
no_grad – If set to false, allows for the computation of gradients with respect to X_train and X_test. For this to correctly function no_preprocessing_mode must be set to true.
subsample_features – If set to true and the number of features in the dataset exceeds self.max_features (100), the features are subsampled to self.max_features.

fit(X, y, overwrite_warning=False)

Validates the training set and stores it.

If clf.no_grad (default is True): X, y should be of type np.array else: X should be of type torch.Tensors (y can be np.array or torch.Tensor)

load_result_minimal(path, i, e)

models_in_memory = {}

predict(X, return_winning_probability=False, normalize_with_test=False)

predict_proba(X, normalize_with_test=False, return_logits=False)

Predict the probabilities for the input X depending on the training set previously passed in the method fit.

If no_grad is true in the classifier the function takes X as a numpy.ndarray. If no_grad is false X must be a torch tensor and is not fully checked.

remove_models_from_memory()

class TALENT.model.models.tabpfn.TabPFNClassifier

Prior-fitting network for zero-shot tabular classification.

__init__(device, base_path)

Initialize TabPFN with pre-trained weights.

Foundation Model Features:

Pre-trained on diverse tabular datasets
No gradient-based training required
Immediate deployment capability
Context-based learning from examples

fit(X, y)

Fit the model using in-context learning (no parameter updates).

In-Context Learning Process:

Context Setup: Store training examples as context
No Weight Updates: Model weights remain frozen
Context Encoding: Encode training data for reference

predict_proba(X)

Make predictions using in-context learning.

Zero-Shot Prediction:

Context Retrieval: Use stored training context
Attention Mechanism: Apply attention over training examples
Prediction Generation: Generate predictions without fine-tuning

Regularization Methods

TANGOS Regularization

class TALENT.model.models.tangos.Tangos(*args: Any, **kwargs: Any)

Bases: Module

cal_representation(x)

cal_tangos_loss(x)

forward(x, x_cat)

class TALENT.model.models.tangos.Tangos

MLP with TANGOS regularization for neuron specialization.

Mathematical Formulation:

TANGOS applies spatial and spectral regularization to encourage neuron specialization:

\[\mathcal{L}_{\text{TANGOS}} = \mathcal{L}_{\text{task}} + \lambda_1 \mathcal{L}_{\text{spatial}} + \lambda_2 \mathcal{L}_{\text{spectral}}\]

__init__(d_in, d_out, d_layers, dropout, lambda1, lambda2)

Initialize TANGOS-regularized MLP.

Parameters:

d_in (int) – Input dimension
d_out (int) – Output dimension
d_layers (List[int]) – Hidden layer dimensions
dropout (float) – Dropout probability
lambda1 (float) – Spatial regularization weight
lambda2 (float) – Spectral regularization weight

forward(x, x_cat=None): Forward pass with standard MLP architecture.

cal_representation(x)

Calculate intermediate representations for regularization.

Parameters:

x (torch.Tensor) – Input features

Returns:

torch.Tensor – Hidden representations before final layer

Representation Extraction Process:

The method extracts intermediate representations by stopping before the final layer:

for i, layer in enumerate(self.layers):
    x = layer(x)
    x = F.relu(x)
    if self.dropout and i != len(self.layers) - 1:
        x = F.dropout(x, self.dropout, self.training)
return x  # Return before final head layer

Regularization Applications:

Spatial Regularization: Encourages spatial locality in neuron activations
Spectral Regularization: Promotes spectral diversity in learned representations

Activation Functions Reference

Standard Activations:

\[\text{ReLU}(x) = \max(0, x)\]

\[\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]\]

\[\begin{split}\text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\end{split}\]

Gated Activations:

\[\text{ReGLU}(x) = a \cdot \text{ReLU}(b) \text{ where } [a, b] = \text{split}(x)\]

\[\text{GeGLU}(x) = a \cdot \text{GELU}(b) \text{ where } [a, b] = \text{split}(x)\]

Probability Functions:

\[\text{Softmax}(x_i) = \frac{\exp(x_i)}{\sum_{j=1}^K \exp(x_j)}\]

\[\text{Sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} ||p - z||_2^2\]

where \(\Delta^{K-1}\) is the probability simplex.

Model Usage Examples

Basic MLP Usage:

from TALENT.model.models.mlp import MLP

# Initialize MLP
model = MLP(
    d_in=10,           # Input dimension
    d_out=3,           # Output dimension (3 classes)
    d_layers=[64, 32], # Hidden layer sizes
    dropout=0.1        # Dropout probability
)

# Forward pass
x = torch.randn(32, 10)  # Batch of 32 samples, 10 features
output = model(x)        # Shape: (32, 3)

ResNet with Advanced Activations:

from TALENT.model.models.resnet import ResNet

# Initialize ResNet with GeGLU activation
model = ResNet(
    d_in=15,
    d_out=1,                    # Regression task
    d=128,                      # Hidden dimension
    d_hidden_factor=2.0,        # Hidden expansion factor
    n_layers=4,                 # Number of residual blocks
    activation='geglu',         # GeGLU activation
    normalization='layernorm',  # Layer normalization
    hidden_dropout=0.1,
    residual_dropout=0.1
)

FT-Transformer with Mixed Features:

from TALENT.model.models.ftt import Transformer

# Initialize FT-Transformer
model = Transformer(
    d_numerical=8,          # 8 numerical features
    categories=[5, 10, 3],  # 3 categorical features with cardinalities
    d_token=64,             # Token dimension
    n_layers=3,             # Number of transformer layers
    n_heads=8,              # Attention heads
    d_ffn_factor=2.0,       # FFN expansion factor
    attention_dropout=0.1,
    ffn_dropout=0.1,
    residual_dropout=0.1,
    activation='reglu',
    prenormalization=True,
    d_out=5                 # 5 classes
)

TabNet for Interpretable Classification:

from TALENT.model.models.tabnet import TabNetClassifier

# Initialize TabNet
model = TabNetClassifier(
    n_steps=3,              # Decision steps
    gamma=1.3,              # Relaxation parameter
    n_independent=2,        # Independent GLU layers
    n_shared=2,             # Shared GLU layers
    momentum=0.02,          # Batch norm momentum
    lambda_sparse=1e-3      # Sparsity regularization
)

# Training
model.fit(X_train, y_train,
          eval_set=[(X_val, y_val)],
          max_epochs=100)

# Get predictions and explanations
predictions = model.predict_proba(X_test)
explanations = model.explain(X_test, normalize=True)

GRANDE for Tree-like Neural Networks:

from TALENT.model.models.grande import GRANDE

# Initialize GRANDE
model = GRANDE(
    batch_size=64,
    task_type='classification',
    depth=4,              # Tree depth
    n_estimators=10,      # Number of trees
    dropout=0.1
)

ModernNCA with Distance-Based Learning:

from TALENT.model.models.modernNCA import ModernNCA

# Initialize ModernNCA
model = ModernNCA(
    d_in=15,
    d_out=4,              # 4 classes
    k=32,                 # Number of neighbors
    dropout=0.1,
    d_embedding=64        # Embedding dimension
)

# Training requires candidate examples
output = model(x, y, candidate_x, candidate_y, is_train=True)

ExcelFormer with Mixup Training:

from TALENT.model.models.excelformer import ExcelFormer

# Initialize ExcelFormer
model = ExcelFormer(
    d_numerical=10,
    d_token=64,
    n_blocks=3,
    attention_dropout=0.1,
    ffn_dropout=0.1,
    d_out=3
)

# Forward pass with feature mixup
output, masks, shuffled_ids = model(
    x_num,
    mix_up=True,
    beta=0.5,
    mtype='feat_mix'
)

TabPFN for Zero-Shot Learning:

from TALENT.model.models.tabpfn import TabPFNClassifier

# Initialize pre-trained TabPFN
model = TabPFNClassifier(device='cuda')

# No training required - just fit context
model.fit(X_train, y_train)

# Immediate predictions
predictions = model.predict_proba(X_test)

Model Selection Guidelines

For Beginners: - MLP: Simple, fast, good baseline - ResNet: Better than MLP for deeper networks

For Best Performance: - FT-Transformer: State-of-the-art on many datasets - TabNet: Excellent performance with interpretability - ModernNCA: Strong embedding-based performance

For Interpretability: - TabNet: Attention-based feature importance - GRANDE: Tree-like decision process - ProtoGate: Prototype-based explanations

For Speed: - MLP: Fastest training and inference - SNN: Lightweight with self-normalization - TabPFN: No training required

For Specific Scenarios: - TabR: Retrieval-based learning - ExcelFormer: Complex feature interactions with mixup - TANGOS: When regularization is critical