Models
Deep learning models for tabular data, implementing various state-of-the-art architectures.
This section contains all the neural network architectures implemented in TALENT, ranging from simple MLPs to advanced transformer-based models specifically designed for tabular data. Each model implements specific forward pass computations, mathematical operations, and architectural innovations.
Basic Neural Networks
Multi-Layer Perceptron (MLP)
- class TALENT.model.models.mlp.MLP
Simple feedforward neural network with multiple fully connected layers and ReLU activations.
Mathematical Formulation:
For input \(x \in \mathbb{R}^{d_{in}}\), the MLP computes:
\[\begin{split}h_0 &= x \\ h_i &= \text{ReLU}(\text{Linear}(h_{i-1})) = \text{ReLU}(W_i h_{i-1} + b_i) \\ \text{output} &= W_{\text{head}} h_L + b_{\text{head}}\end{split}\]where \(L\) is the number of hidden layers.
- __init__(d_in, d_out, d_layers, dropout)
Initialize the MLP architecture.
Parameters:
d_in (int) – Input feature dimension
d_out (int) – Output dimension (number of classes for classification, 1 for regression)
d_layers (List[int]) – Hidden layer dimensions, e.g., [64, 32] for two hidden layers
dropout (float) – Dropout probability applied after each hidden layer
Architecture Construction:
Hidden Layers: Creates nn.Linear layers with dimensions specified in d_layers
Output Head: Final linear layer mapping to output dimension
Dropout Setup: Configures dropout for regularization during training
- forward(x, x_cat=None)
Forward pass through the MLP network.
Parameters:
x (torch.Tensor) – Input numerical features of shape (batch_size, d_in)
x_cat (torch.Tensor, optional) – Categorical features (not used in MLP, maintained for interface consistency)
Returns:
torch.Tensor – Output predictions of shape (batch_size, d_out) or (batch_size,) for regression
Forward Pass Implementation:
for layer in self.layers: x = layer(x) # Linear: x = W @ x + b x = F.relu(x) # ReLU: x = max(0, x) if self.dropout: x = F.dropout(x, self.dropout, self.training) logit = self.head(x) # Final output layer if self.d_out == 1: logit = logit.squeeze(-1) # For regression
ReLU Activation:
\[\text{ReLU}(x) = \max(0, x)\]Dropout Regularization:
During training, randomly sets elements to zero with probability dropout:
\[\begin{split}\text{Dropout}(x) = \begin{cases} \frac{x}{1-p} & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}\end{split}\]
Residual Network (ResNet)
- class TALENT.model.models.resnet.ResNet(*args: Any, **kwargs: Any)
Bases:
Module- forward(x: torch.Tensor, x_cat: torch.Tensor) torch.Tensor
- TALENT.model.models.resnet.geglu(x)
- TALENT.model.models.resnet.get_activation_fn(name)
- TALENT.model.models.resnet.get_nonglu_activation_fn(name)
- TALENT.model.models.resnet.reglu(x)
- class TALENT.model.models.resnet.ResNet
Deep residual network with skip connections for tabular data, preventing gradient vanishing in deep architectures.
Mathematical Formulation:
ResNet uses residual blocks with skip connections:
\[h_{i+1} = h_i + F(h_i, W_i)\]where \(F(h_i, W_i)\) is the residual function.
- __init__(d_in, d, d_hidden_factor, n_layers, activation, normalization, hidden_dropout, residual_dropout, d_out)
Initialize the ResNet architecture with configurable components.
Parameters:
d_in (int) – Input feature dimension
d (int) – Hidden dimension for residual blocks
d_hidden_factor (float) – Factor to scale hidden layer width within blocks
n_layers (int) – Number of residual blocks
activation (str) – Activation function (‘relu’, ‘gelu’, ‘reglu’, ‘geglu’)
normalization (str) – Normalization type (‘batchnorm’, ‘layernorm’)
hidden_dropout (float) – Dropout probability within residual blocks
residual_dropout (float) – Dropout probability for residual connections
d_out (int) – Output dimension
- forward(x, x_cat=None)
Forward pass through the ResNet architecture.
Parameters:
x (torch.Tensor) – Input numerical features
x_cat (torch.Tensor, optional) – Categorical features (not used)
Returns:
torch.Tensor – Output predictions
Residual Block Mathematical Implementation:
For each residual block, the computation follows:
\[\begin{split}\text{residual} &= \text{Norm}(h_i) \\ \text{residual} &= \text{Linear}(\text{residual}) \\ \text{residual} &= \text{Activation}(\text{residual}) \\ \text{residual} &= \text{Dropout}(\text{residual}) \\ \text{residual} &= \text{Linear}(\text{residual}) \\ \text{residual} &= \text{Dropout}(\text{residual}) \\ h_{i+1} &= h_i + \text{residual}\end{split}\]Activation Functions:
ReLU: \(\text{ReLU}(x) = \max(0, x)\)
GELU: \(\text{GELU}(x) = x \cdot \Phi(x)\)
ReGLU: \(\text{ReGLU}(x) = a \cdot \text{ReLU}(b)\) where \(a, b = \text{split}(x)\)
GeGLU: \(\text{GeGLU}(x) = a \cdot \text{GELU}(b)\) where \(a, b = \text{split}(x)\)
- reglu(x)
ReGLU activation function for gated linear units.
Mathematical Definition:
\[\text{ReGLU}(x) = a \cdot \text{ReLU}(b)\]where \(a\) and \(b\) are obtained by splitting \(x\) along the last dimension.
- geglu(x)
GeGLU activation function combining gating with GELU.
Mathematical Definition:
\[\text{GeGLU}(x) = a \cdot \text{GELU}(b)\]where \(a\) and \(b\) are obtained by splitting \(x\) along the last dimension.
Self-Normalizing Network (SNN)
- class TALENT.model.models.snn.SNN(*args: Any, **kwargs: Any)
Bases:
Module- calculate_output(x: torch.Tensor) torch.Tensor
- encode(x_num, x_cat)
- forward(x_num: torch.Tensor, x_cat) torch.Tensor
- class TALENT.model.models.snn.SNN
Lightweight neural network with self-normalizing properties using SELU activation.
- __init__(d_in, d_out, d_layers, dropout)
Initialize SNN with SELU activations for self-normalization.
Parameters:
d_in (int) – Input dimension
d_out (int) – Output dimension
d_layers (List[int]) – Hidden layer dimensions
dropout (float) – Dropout probability
- forward(x, x_cat=None)
Forward pass with SELU activation for self-normalization.
SELU Activation Mathematical Definition:
\[\begin{split}\text{SELU}(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\end{split}\]where \(\lambda \approx 1.0507\) and \(\alpha \approx 1.6733\).
Self-Normalization Property:
SELU ensures that for normalized inputs, activations maintain: - Mean converges to 0 - Variance converges to 1 - Enables training of very deep networks without explicit normalization
Transformer-Based Models
Feature Tokenizer Transformer (FT-Transformer)
- class TALENT.model.models.ftt.Transformer(*args: Any, **kwargs: Any)
Bases:
ModuleTransformer.
References: - https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html - https://github.com/facebookresearch/pytext/tree/master/pytext/models/representations/transformer - https://github.com/pytorch/fairseq/blob/1bba712622b8ae4efb3eb793a8a40da386fe11d0/examples/linformer/linformer_src/modules/multihead_linear_attention.py#L19
- TALENT.model.models.ftt.geglu(x)
- TALENT.model.models.ftt.get_activation_fn(name)
- TALENT.model.models.ftt.get_nonglu_activation_fn(name)
- TALENT.model.models.ftt.reglu(x)
- class TALENT.model.models.ftt.Transformer
Advanced transformer architecture specifically designed for tabular data with feature tokenization.
Mathematical Formulation:
Feature Tokenization:
For numerical features: \(t_i^{\text{num}} = W_{\text{num}} x_i + b_{\text{num}}\)
For categorical features: \(t_i^{\text{cat}} = \text{Embedding}(x_i^{\text{cat}})\)
- __init__(d_numerical, categories, d_token, n_layers, n_heads, d_ffn_factor, attention_dropout, ffn_dropout, residual_dropout, activation, prenormalization, d_out)
Initialize the FT-Transformer architecture.
Parameters:
d_numerical (int) – Number of numerical features
categories (List[int], optional) – Cardinalities for categorical features
d_token (int) – Token embedding dimension
n_layers (int) – Number of transformer layers
n_heads (int) – Number of attention heads
d_ffn_factor (float) – Factor for feed-forward network dimension
attention_dropout (float) – Dropout for attention weights
ffn_dropout (float) – Dropout for feed-forward network
residual_dropout (float) – Dropout for residual connections
activation (str) – Activation function for FFN
prenormalization (bool) – Whether to use pre-normalization
d_out (int) – Output dimension
- forward(x_num, x_cat)
Forward pass through the transformer.
Parameters:
x_num (torch.Tensor, optional) – Numerical features of shape (batch_size, d_numerical)
x_cat (torch.Tensor, optional) – Categorical features of shape (batch_size, n_categorical)
Returns:
torch.Tensor – Output predictions
Transformer Processing Pipeline:
Tokenization: Convert features to tokens using Tokenizer
CLS Token Addition: Prepend classification token
Transformer Layers: Apply multi-head attention and feed-forward networks
Output Generation: Use CLS token representation for final prediction
Transformer Layer Mathematical Implementation:
For each transformer layer:
\[\begin{split}\text{attn_out} &= \text{MultiHeadAttention}(x, x, x) \\ x &= \text{LayerNorm}(x + \text{attn_out}) \\ \text{ffn_out} &= \text{FFN}(x) \\ x &= \text{LayerNorm}(x + \text{ffn_out})\end{split}\]
- class TALENT.model.models.ftt.Tokenizer
Converts numerical and categorical features into token embeddings for transformer processing.
- __init__(d_numerical, categories, d_token, bias)
Initialize the feature tokenizer.
Parameters:
d_numerical (int) – Number of numerical features
categories (List[int], optional) – Cardinalities of categorical features
d_token (int) – Token embedding dimension
bias (bool) – Whether to use bias in tokenization
- forward(x_num, x_cat)
Convert features to token embeddings.
Tokenization Process:
Numerical Features:
\[\text{tokens}_{\text{num}} = x_{\text{num}} W_{\text{num}} + b_{\text{num}}\]Categorical Features:
\[\text{tokens}_{\text{cat}} = \text{Embedding}(x_{\text{cat}} + \text{offsets})\]CLS Token:
\[\text{tokens}_{\text{cls}} = W_{\text{cls}}\]
- property n_tokens
Total number of tokens (numerical + categorical + CLS).
Returns:
int – Total token count
- class TALENT.model.models.ftt.MultiheadAttention
Multi-head attention mechanism optimized for tabular data.
- __init__(d, n_heads, dropout, bias)
Initialize multi-head attention.
Parameters:
d (int) – Input dimension
n_heads (int) – Number of attention heads
dropout (float) – Attention dropout probability
bias (bool) – Whether to use bias in projections
- forward(x_q, x_kv, key_compression, value_compression)
Compute multi-head attention.
Parameters:
x_q (torch.Tensor) – Query input
x_kv (torch.Tensor) – Key and value input
key_compression (nn.Linear, optional) – Key compression layer
value_compression (nn.Linear, optional) – Value compression layer
Returns:
torch.Tensor – Attention output
Multi-Head Attention Mathematical Implementation:
Linear Projections:
\[Q = x_q W^Q, \quad K = x_{kv} W^K, \quad V = x_{kv} W^V\]Scaled Dot-Product Attention:
\[\text{attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)\]Output Computation:
\[\text{output} = \text{attention} \cdot V\]Multi-Head Combination:
\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O\]
Advanced Tabular Models
TabNet
- class TALENT.model.models.tabnet.TabNetClassifier(*args: Any, **kwargs: Any)
Bases:
TabModel- compute_loss(y_pred, y_true)
Compute the loss.
- Parameters
y_score (a :tensor: torch.Tensor) – Score matrix
y_true (a :tensor: torch.Tensor) – Target matrix
- Returns
Loss value
- Return type
- predict_func(outputs)
- predict_proba(X)
Make predictions for classification on a batch (valid)
- Parameters
X (a :tensor: torch.Tensor or matrix: scipy.sparse.csr_matrix) – Input data
- Returns
res
- Return type
np.ndarray
- prepare_target(y)
Prepare target before training.
- Parameters
y (a :tensor: torch.Tensor) – Target matrix.
- Returns
Converted target matrix.
- Return type
torch.Tensor
- stack_batches(list_y_true, list_y_score)
- update_fit_params(X_train, y_train, eval_set, weights)
Set attributes relative to fit function.
- Parameters
X_train (np.ndarray) – Train set
y_train (np.array) – Train targets
eval_set (list of tuple) – List of eval tuple set (X, y).
weights (bool or dictionnary) – 0 for no balancing 1 for automated balancing
- class TALENT.model.models.tabnet.TabNetRegressor(*args: Any, **kwargs: Any)
Bases:
TabModel- compute_loss(y_pred, y_true)
Compute the loss.
- Parameters
y_score (a :tensor: torch.Tensor) – Score matrix
y_true (a :tensor: torch.Tensor) – Target matrix
- Returns
Loss value
- Return type
- predict_func(outputs)
- prepare_target(y)
Prepare target before training.
- Parameters
y (a :tensor: torch.Tensor) – Target matrix.
- Returns
Converted target matrix.
- Return type
torch.Tensor
- stack_batches(list_y_true, list_y_score)
- update_fit_params(X_train, y_train, eval_set, weights)
Set attributes relative to fit function.
- Parameters
X_train (np.ndarray) – Train set
y_train (np.array) – Train targets
eval_set (list of tuple) – List of eval tuple set (X, y).
weights (bool or dictionnary) – 0 for no balancing 1 for automated balancing
- class TALENT.model.models.tabnet.TabNetClassifier
Interpretable deep learning model with sequential attention mechanism for classification.
Mathematical Formulation:
TabNet uses sequential feature selection through sparsemax attention:
Feature Selection at Step i:
\[M^{[i]} = \text{sparsemax}(\text{AttentionTransformer}(f^{[i-1]}))\]Feature Processing:
\[f^{[i]} = \gamma \odot M^{[i]} \odot h + (1-\gamma) \odot f^{[i-1]}\]where \(\gamma\) is the relaxation parameter.
- __init__(n_steps, gamma, n_independent, n_shared, momentum, optimizer_params, scheduler_params, mask_type, lambda_sparse, seed)
Initialize TabNet classifier.
Parameters:
n_steps (int) – Number of decision steps
gamma (float) – Relaxation parameter for feature selection
n_independent (int) – Number of independent GLU layers per step
n_shared (int) – Number of shared GLU layers
momentum (float) – Momentum for batch normalization
optimizer_params (dict) – Optimizer configuration
scheduler_params (dict) – Learning rate scheduler parameters
mask_type (str) – Type of attention mask (‘sparsemax’ or ‘entmax’)
lambda_sparse (float) – Sparsity regularization coefficient
seed (int) – Random seed
- fit(X_train, y_train, eval_set, eval_name, eval_metric, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last, callbacks)
Train the TabNet model.
Training Process:
Data Preprocessing: Handle categorical encoding and normalization
Sequential Training: Train each decision step sequentially
Attention Regularization: Apply sparsity constraints on attention masks
Early Stopping: Monitor validation metrics for convergence
- predict_proba(X)
Make probability predictions for classification.
Parameters:
X (torch.Tensor or scipy.sparse matrix) – Input features
Returns:
np.ndarray – Class probabilities of shape (n_samples, n_classes)
Prediction Process:
Forward Pass: Process through all decision steps
Attention Aggregation: Combine attention from all steps
Softmax Application: Convert logits to probabilities
\[P(y=k|x) = \frac{\exp(o_k)}{\sum_{j=1}^K \exp(o_j)}\]where \(o_k\) is the raw output for class \(k\).
- explain(X, normalize)
Generate feature importance explanations using attention masks.
Parameters:
X (torch.Tensor) – Input features
normalize (bool) – Whether to normalize importance scores
Returns:
np.ndarray – Feature importance matrix
Explanation Generation:
Attention masks from each decision step provide interpretable feature importance:
\[\text{importance}_{ij} = \frac{M^{[i]}_j}{\sum_{k=1}^{n_features} M^{[i]}_k}\]
- class TALENT.model.models.tabnet.TabNetRegressor
TabNet for regression tasks with mean squared error optimization.
- compute_loss(y_pred, y_true)
Compute mean squared error loss for regression.
MSE Loss Mathematical Definition:
\[\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
Tree-Based Neural Models
GRANDE (Gradient-Boosted Neural Decision Ensembles)
- class TALENT.model.models.grande.GRANDE(*args: Any, **kwargs: Any)
Bases:
Module- apply_preprocessing(X)
- build_model()
- entmax15(inputs, axis=- 1)
- entmax_threshold_and_support(inputs, axis=- 1)
- forward(inputs)
- preprocess_data(X_train, y_train, X_val, y_val)
- set_params(**kwargs)
- class TALENT.model.models.grande.GRANDE
Tree-mimic neural network using gradient descent for decision tree simulation.
Mathematical Formulation:
GRANDE simulates decision trees using neural operations with entmax for sparse selection.
- __init__(batch_size, task_type, depth, n_estimators, dropout)
Initialize GRANDE model.
Parameters:
batch_size (int) – Training batch size
task_type (str) – ‘classification’ or ‘regression’
depth (int) – Maximum tree depth
n_estimators (int) – Number of tree estimators
dropout (float) – Dropout probability
- forward(inputs)
Forward pass through the GRANDE ensemble.
Parameters:
inputs (torch.Tensor) – Input features
Returns:
torch.Tensor – Ensemble predictions
Tree Simulation Mathematical Implementation:
Split Decision Computation:
\[\text{node_result} = \frac{\text{softsign}(s_1 - s_2) + 1}{2}\]where \(s_1\) are learned split thresholds and \(s_2\) are feature values.
Path Probability Calculation:
\[p = \prod_{j} ((1-\text{path_id}_j) \cdot \text{node_result}_j + \text{path_id}_j \cdot (1-\text{node_result}_j))\]Ensemble Output for Regression:
\[\text{output} = \sum_{e,l} w_e \cdot p_{e,l} \cdot v_{e,l}\]where \(w_e\) are estimator weights, \(p_{e,l}\) are leaf probabilities, and \(v_{e,l}\) are leaf values.
Ensemble Output for Classification:
\[\text{output} = \sum_{e,l} w_e \cdot p_{e,l} \cdot \text{softmax}(v_{e,l})\]
- get_representation(inputs)
Extract intermediate tree representations for analysis.
Returns:
torch.Tensor – Tree path representations
Neural Oblivious Decision Ensembles (NODE)
- class TALENT.model.models.node.NODE(*, d_in: int, num_layers: int, layer_dim: int, depth: int, tree_dim: int, choice_function: str, bin_function: str, d_out: int)
Bases:
ModuleInitializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x_num: Tensor, x_cat: Tensor) Tensor
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class TALENT.model.models.node.Node
Neural implementation of oblivious decision trees with differentiable splits.
- __init__(input_dim, layer_dim, output_dim, num_layers, tree_dim, depth, choice_function, bin_function)
Initialize NODE architecture.
Parameters:
input_dim (int) – Input feature dimension
layer_dim (int) – Hidden layer dimension
output_dim (int) – Output dimension
num_layers (int) – Number of NODE layers
tree_dim (int) – Number of trees per layer
depth (int) – Tree depth
choice_function (str) – Function for feature selection (‘entmax15’)
bin_function (str) – Function for threshold selection (‘entmoid15’)
- forward(x)
Forward pass through oblivious decision trees.
Decision Tree Mathematical Process:
Feature Selection: Use entmax for sparse feature selection
Threshold Comparison: Compare features with learned thresholds
Path Aggregation: Aggregate predictions along tree paths
Ensemble Combination: Combine outputs from multiple trees
GrowNet (Gradient Boosting with Neural Networks)
- class TALENT.model.models.grownet.DynamicNet(lr, categories: Optional[List[int]], d_embedding: Optional[int])
Bases:
object- add(model)
- embed_input(x_num, x_cat)
- forward(x_num, x_cat)
- forward_grad(x_num, x_cat)
- classmethod from_file(path, builder)
- parameters()
- to_cuda()
- to_double()
- to_eval()
- to_file(path)
- to_train()
- zero_grad()
- class TALENT.model.models.grownet.ForwardType(value)
Bases:
EnumAn enumeration.
- CASCADE = 2
- GRADIENT = 3
- SIMPLE = 0
- STACKED = 1
- class TALENT.model.models.grownet.MLP_2HL(dim_in, dim_hidden1, dim_hidden2, dim_out, sparse=False, bn=True)
Bases:
ModuleInitializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x, lower_f)
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- classmethod get_model(stage, opt)
- class TALENT.model.models.grownet.SpLinear(input_features, output_features, bias=True)
Bases:
ModuleInitializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(input)
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class TALENT.model.models.grownet.SpLinearFunc(*args, **kwargs)
Bases:
Function- static backward(ctx, grad_output)
Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function).
This function is to be overridden by all subclasses.
It must accept a context
ctxas the first argument, followed by as many outputs as theforward()returned (None will be passed in for non tensor outputs of the forward function), and it should return as many tensors, as there were inputs toforward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input. If an input is not a Tensor or is a Tensor not requiring grads, you can just pass None as a gradient for that input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_gradas a tuple of booleans representing whether each input needs gradient. E.g.,backward()will havectx.needs_input_grad[0] = Trueif the first input toforward()needs gradient computated w.r.t. the output.
- static forward(ctx, input, weight, bias=None)
This function is to be overridden by all subclasses. There are two ways to define forward:
Usage 1 (Combined forward and ctx):
@staticmethod def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any: pass
It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).
See combining-forward-context for more details
Usage 2 (Separate forward and ctx):
@staticmethod def forward(*args: Any, **kwargs: Any) -> Any: pass @staticmethod def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None: pass
The forward no longer accepts a ctx argument.
Instead, you must also override the
torch.autograd.Function.setup_context()staticmethod to handle setting up thectxobject.outputis the output of the forward,inputsare a Tuple of inputs to the forward.See extending-autograd for more details
The context can be used to store arbitrary data that can be then retrieved during the backward pass. Tensors should not be stored directly on ctx (though this is not currently enforced for backward compatibility). Instead, tensors should be saved either with
ctx.save_for_backward()if they are intended to be used inbackward(equivalently,vjp) orctx.save_for_forward()if they are intended to be used for injvp.
- class TALENT.model.models.grownet.GrowNet
Gradient boosting framework with neural network weak learners.
- __init__(input_dim, output_dim, boost_rate, layers_per_net, layer_dims, dropout)
Initialize GrowNet with neural weak learners.
Gradient Boosting Process:
Weak Learner Training: Train neural networks on residuals
Boosting Update: Add weak learners with adaptive weights
Gradient Computation: Compute gradients for next weak learner
- forward(x)
Forward pass through the boosted ensemble.
Boosting Mathematical Formulation:
\[F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)\]where \(h_m\) is the m-th weak learner and \(\gamma_m\) is the boosting rate.
Distance-Based Models
Modern Neighborhood Component Analysis (ModernNCA)
- class TALENT.model.models.modernNCA.MLP_Block(*args: Any, **kwargs: Any)
Bases:
Module- forward(x: torch.Tensor) torch.Tensor
- class TALENT.model.models.modernNCA.ModernNCA(*args: Any, **kwargs: Any)
Bases:
Module- forward(x, y, candidate_x, candidate_y, is_train)
- make_layer()
- class TALENT.model.models.modernNCA.ModernNCA
Neighborhood Component Analysis-inspired model for embedding-based predictions.
Mathematical Formulation:
ModernNCA learns embeddings for distance-based classification.
- __init__(d_in, d_out, k, dropout, d_embedding)
Initialize ModernNCA model.
Parameters:
d_in (int) – Input feature dimension
d_out (int) – Output dimension (number of classes)
k (int) – Number of nearest neighbors to consider
dropout (float) – Dropout probability
d_embedding (int) – Embedding dimension
- forward(x, y, candidate_x, candidate_y, is_train)
Forward pass with neighborhood analysis.
Parameters:
x (torch.Tensor) – Query features
y (torch.Tensor) – Query labels
candidate_x (torch.Tensor) – Candidate features for nearest neighbor search
candidate_y (torch.Tensor) – Candidate labels
is_train (bool) – Training mode flag
Returns:
torch.Tensor – Distance-based predictions
Distance-Based Prediction Mathematical Implementation:
Embedding Computation:
\[e_i = f(x_i), \quad e_j = f(x_j)\]where \(f\) is the learned embedding function.
Distance Computation:
\[d(x_i, x_j) = ||e_i - e_j||_2\]Neighbor Weighting:
\[p_{ij} = \frac{\exp(-d(x_i, x_j))}{\sum_{k \neq i} \exp(-d(x_i, x_k))}\]Final Prediction:
\[\hat{y}_i = \sum_j p_{ij} y_j\]
- knn_prediction(x, candidate_x, candidate_y, k)
Make predictions using k-nearest neighbors in embedding space.
K-NN Process:
Distance Calculation: Compute distances in embedding space
Neighbor Selection: Find k nearest neighbors
Prediction Aggregation: Aggregate neighbor labels with distance weighting
Specialized Architectures
ExcelFormer (Semi-Permeable Attention)
- class TALENT.model.models.excelformer.ExcelFormer(*, d_numerical: int, token_bias: bool, n_layers: int, d_token: int, n_heads: int, attention_dropout: float, ffn_dropout: float, residual_dropout: float, prenormalization: bool, kv_compression: Optional[float], kv_compression_sharing: Optional[str], d_out: int, init_scale: float = 0.1)
Bases:
ModuleExcelFormer with All initialized by small value
initial function: v4
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x_num: Tensor, x_cat=None, mix_up: bool = False, beta=0.5, mtype='feat_mix') Tensor
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class TALENT.model.models.excelformer.MultiheadAttention(d: int, n_heads: int, dropout: float, init_scale: float = 0.01)
Bases:
ModuleInitializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x_q: Tensor, x_kv: Tensor, key_compression: Optional[Linear], value_compression: Optional[Linear]) Tensor
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- get_attention_mask(input_shape, device)
- class TALENT.model.models.excelformer.Tokenizer(d_numerical: int, categories: Optional[List[int]], d_token: int, bias: bool)
Bases:
ModuleInitializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x_num: Tensor) Tensor
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- TALENT.model.models.excelformer.attenuated_kaiming_uniform_(tensor, a=2.23606797749979, scale=1.0, mode='fan_in', nonlinearity='leaky_relu')
- class TALENT.model.models.excelformer.ExcelFormer
Transformer with semi-permeable attention and mixup training capabilities.
- __init__(d_numerical, d_token, n_blocks, attention_dropout, ffn_dropout, residual_dropout, d_out)
Initialize ExcelFormer architecture.
Parameters:
d_numerical (int) – Number of numerical features
d_token (int) – Token embedding dimension
n_blocks (int) – Number of transformer blocks
attention_dropout (float) – Attention dropout probability
ffn_dropout (float) – Feed-forward dropout probability
residual_dropout (float) – Residual connection dropout
d_out (int) – Output dimension
- forward(x_num, x_cat, mix_up, beta, mtype)
Forward pass with optional mixup augmentation.
Parameters:
x_num (torch.Tensor) – Numerical features
x_cat (torch.Tensor, optional) – Categorical features
mix_up (bool) – Whether to apply mixup
beta (float) – Mixup parameter (default: 0.5)
mtype (str) – Mixup type (‘feat_mix’, ‘hidden_mix’, ‘naive_mix’)
Returns:
tuple – (output, feat_masks, shuffled_ids) for mixup training
Mixup Mathematical Implementation:
Feature Mixup:
\[\tilde{x} = \lambda x_i + (1-\lambda) x_j\]Semi-Permeable Attention:
\[\text{Attention}_{\text{perm}}(Q, K, V) = \text{mask} \odot \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
- mixup_process(x, beta, mtype)
Apply mixup augmentation to input features.
Mixup Types:
feat_mix: Feature-level mixing with learnable weights
hidden_mix: Hidden representation mixing
naive_mix: Simple linear interpolation
ProtoGate (Prototype-Based Gating)
- class TALENT.model.models.protogate.GatingNet(*args: Any, **kwargs: Any)
Bases:
ModuleGating Network for feature selection
- Parameters
input_dim (int) – input dimension of the gating network
a (float) – coefficient in hard relu activation function
sigma (float) – std of the gaussion reparameterization noise
activation (str) – activation function of the gating net: ‘relu’, ‘l_relu’, ‘sigmoid’, ‘tanh’, or ‘none’
hidden_layer_list (list) – number of nodes for each hidden layer of the gating net, example: [200,200]
- forward(x)
- get_stochastic_gate(alpha)
This function replaced the feature_selector function in order to save Z
- hard_sigmoid(x)
Segment-wise linear approximation of sigmoid. Faster than sigmoid. Returns 0. if x < -2.5, 1. if x > 2.5. In -2.5 <= x <= 2.5, returns 0.2 * x + 0.5. # Arguments
x: A tensor or variable.
- # Returns
A tensor.
- class TALENT.model.models.protogate.HybridSort(*args: Any, **kwargs: Any)
Bases:
Module- forward(scores: torch.Tensor)
scores: elements to be sorted. Typical shape: batch_size x n x 1
- class TALENT.model.models.protogate.KNNNet(*args: Any, **kwargs: Any)
Bases:
Module- forward(query, neighbors, tau=1.0)
- class TALENT.model.models.protogate.PL(*args: Any, **kwargs: Any)
Bases:
Distributionscores. Shape: (batch_size x) n tau: temperature for the relaxation. Scalar. hard: use straight-through estimation if True
- arg_constraints = {'scores': torch.distributions.constraints.positive, 'tau': torch.distributions.constraints.positive}
- has_rsample = True
- log_prob(value)
value: permutation matrix. shape: batch_size x n x n
- property mean
- relaxed_sort(inp)
inp: elements to be sorted. Typical shape: batch_size x n x 1
- rsample(sample_shape, log_score=True)
sample_shape: number of samples from the PL distribution. Scalar.
- TALENT.model.models.protogate.get_activation(value)
- class TALENT.model.models.protogate.ProtoGate
Prototype-based model with gating mechanisms for interpretable feature selection.
- __init__(input_dim, output_dim, n_prototypes, n_components, dropout)
Initialize ProtoGate architecture.
Parameters:
input_dim (int) – Input feature dimension
output_dim (int) – Output dimension
n_prototypes (int) – Number of learned prototypes
n_components (int) – Number of components per prototype
dropout (float) – Dropout probability
- forward(x)
Forward pass with prototype-based gating.
Prototype-Based Processing:
Prototype Computation: Learn representative prototypes from data
Distance Calculation: Compute distances to prototypes
Gate Generation: Use distances to generate feature gates
Feature Selection: Apply gates for adaptive feature selection
- class TALENT.model.models.protogate.GatingNet
Gating network for prototype-based feature selection.
- hard_sigmoid(x)
Hard sigmoid activation for efficient gating.
Hard Sigmoid Mathematical Definition:
\[\text{hard_sigmoid}(x) = \max(0, \min(1, \frac{x + 1}{2}))\]This provides a piecewise linear approximation to the sigmoid function for computational efficiency.
- forward(x)
Generate gating weights for feature selection.
Retrieval-Based Models
TabR (Tabular Retrieval)
- class TALENT.model.models.tabr.TabR(*args: Any, **kwargs: Any)
Bases:
Module- forward(*, x_num: torch.Tensor, x_cat: Optional[torch.Tensor], y: Optional[torch.Tensor], candidate_x_num: Optional[torch.Tensor], candidate_x_cat: Optional[torch.Tensor], candidate_y: torch.Tensor, context_size: int, is_train: bool) torch.Tensor
- reset_parameters()
- class TALENT.model.models.tabr.TabR
KNN-attention hybrid model with retrieval-based predictions.
- __init__(n_num_features, n_cat_features, n_classes, context_size, normalization, num_embeddings, d_main, d_multiplier, encoder_n_blocks, predictor_n_blocks, mixer_normalization, dropout0, dropout1, normalization, activation)
Initialize TabR architecture.
Parameters:
n_num_features (int) – Number of numerical features
n_cat_features (int) – Number of categorical features
n_classes (int) – Number of output classes
context_size (int) – Maximum context size for retrieval
normalization (str) – Normalization type
num_embeddings (dict) – Embedding configurations
d_main (int) – Main hidden dimension
d_multiplier (int) – Dimension multiplier
encoder_n_blocks (int) – Number of encoder blocks
predictor_n_blocks (int) – Number of predictor blocks
mixer_normalization (str) – Mixer normalization type
dropout0 (float) – Input dropout
dropout1 (float) – Hidden dropout
activation (str) – Activation function
- forward(x_num, x_cat, candidate_x_num, candidate_x_cat, candidate_y, context_size, is_train)
Forward pass with retrieval-based attention.
Retrieval Process:
Context Selection: Select relevant examples from training set
Attention Computation: Apply attention over retrieved candidates
Feature Processing: Process query and candidate features
Prediction Generation: Combine retrieval and learned representations
Foundation Models
TabPFN (Tabular Prior-Fitting Networks)
- class TALENT.model.models.tabpfn.TabPFNClassifier(*args: Any, **kwargs: Any)
Bases:
BaseEstimator,ClassifierMixinInitializes the classifier and loads the model. Depending on the arguments, the model is either loaded from memory, from a file, or downloaded from the repository if no model is found.
Can also be used to compute gradients with respect to the inputs X_train and X_test. Therefore no_grad has to be set to False and no_preprocessing_mode must be True. Furthermore, X_train and X_test need to be given as torch.Tensors and their requires_grad parameter must be set to True.
- Parameters
device – If the model should run on cuda or cpu.
base_path – Base path of the directory, from which the folders like models_diff can be accessed.
model_string – Name of the model. Used first to check if the model is already in memory, and if not, tries to load a model with that name from the models_diff directory. It looks for files named as follows: “prior_diff_real_checkpoint” + model_string + “_n_0_epoch_e.cpkt”, where e can be a number between 100 and 0, and is checked in a descending order.
N_ensemble_configurations – The number of ensemble configurations used for the prediction. Thereby the accuracy, but also the running time, increases with this number.
no_preprocess_mode – Specifies whether preprocessing is to be performed.
multiclass_decoder – If set to permutation, randomly shifts the classes for each ensemble configuration.
feature_shift_decoder – If set to true shifts the features for each ensemble configuration according to a random permutation.
only_inference – Indicates if the model should be loaded to only restore inference capabilities or also training capabilities. Note that the training capabilities are currently not being fully restored.
seed – Seed that is used for the prediction. Allows for a deterministic behavior of the predictions.
batch_size_inference – This parameter is a trade-off between performance and memory consumption. The computation done with different values for batch_size_inference is the same, but it is split into smaller/larger batches.
no_grad – If set to false, allows for the computation of gradients with respect to X_train and X_test. For this to correctly function no_preprocessing_mode must be set to true.
subsample_features – If set to true and the number of features in the dataset exceeds self.max_features (100), the features are subsampled to self.max_features.
- fit(X, y, overwrite_warning=False)
Validates the training set and stores it.
If clf.no_grad (default is True): X, y should be of type np.array else: X should be of type torch.Tensors (y can be np.array or torch.Tensor)
- load_result_minimal(path, i, e)
- models_in_memory = {}
- predict(X, return_winning_probability=False, normalize_with_test=False)
- predict_proba(X, normalize_with_test=False, return_logits=False)
Predict the probabilities for the input X depending on the training set previously passed in the method fit.
If no_grad is true in the classifier the function takes X as a numpy.ndarray. If no_grad is false X must be a torch tensor and is not fully checked.
- remove_models_from_memory()
- class TALENT.model.models.tabpfn.TabPFNClassifier
Prior-fitting network for zero-shot tabular classification.
- __init__(device, base_path)
Initialize TabPFN with pre-trained weights.
Foundation Model Features:
Pre-trained on diverse tabular datasets
No gradient-based training required
Immediate deployment capability
Context-based learning from examples
- fit(X, y)
Fit the model using in-context learning (no parameter updates).
In-Context Learning Process:
Context Setup: Store training examples as context
No Weight Updates: Model weights remain frozen
Context Encoding: Encode training data for reference
- predict_proba(X)
Make predictions using in-context learning.
Zero-Shot Prediction:
Context Retrieval: Use stored training context
Attention Mechanism: Apply attention over training examples
Prediction Generation: Generate predictions without fine-tuning
Regularization Methods
TANGOS Regularization
- class TALENT.model.models.tangos.Tangos(*args: Any, **kwargs: Any)
Bases:
Module- cal_representation(x)
- cal_tangos_loss(x)
- forward(x, x_cat)
- class TALENT.model.models.tangos.Tangos
MLP with TANGOS regularization for neuron specialization.
Mathematical Formulation:
TANGOS applies spatial and spectral regularization to encourage neuron specialization:
\[\mathcal{L}_{\text{TANGOS}} = \mathcal{L}_{\text{task}} + \lambda_1 \mathcal{L}_{\text{spatial}} + \lambda_2 \mathcal{L}_{\text{spectral}}\]- __init__(d_in, d_out, d_layers, dropout, lambda1, lambda2)
Initialize TANGOS-regularized MLP.
Parameters:
d_in (int) – Input dimension
d_out (int) – Output dimension
d_layers (List[int]) – Hidden layer dimensions
dropout (float) – Dropout probability
lambda1 (float) – Spatial regularization weight
lambda2 (float) – Spectral regularization weight
- forward(x, x_cat=None)
Forward pass with standard MLP architecture.
- cal_representation(x)
Calculate intermediate representations for regularization.
Parameters:
x (torch.Tensor) – Input features
Returns:
torch.Tensor – Hidden representations before final layer
Representation Extraction Process:
The method extracts intermediate representations by stopping before the final layer:
for i, layer in enumerate(self.layers): x = layer(x) x = F.relu(x) if self.dropout and i != len(self.layers) - 1: x = F.dropout(x, self.dropout, self.training) return x # Return before final head layer
Regularization Applications:
Spatial Regularization: Encourages spatial locality in neuron activations
Spectral Regularization: Promotes spectral diversity in learned representations
Activation Functions Reference
Standard Activations:
Gated Activations:
Probability Functions:
where \(\Delta^{K-1}\) is the probability simplex.
Model Usage Examples
Basic MLP Usage:
from TALENT.model.models.mlp import MLP
# Initialize MLP
model = MLP(
d_in=10, # Input dimension
d_out=3, # Output dimension (3 classes)
d_layers=[64, 32], # Hidden layer sizes
dropout=0.1 # Dropout probability
)
# Forward pass
x = torch.randn(32, 10) # Batch of 32 samples, 10 features
output = model(x) # Shape: (32, 3)
ResNet with Advanced Activations:
from TALENT.model.models.resnet import ResNet
# Initialize ResNet with GeGLU activation
model = ResNet(
d_in=15,
d_out=1, # Regression task
d=128, # Hidden dimension
d_hidden_factor=2.0, # Hidden expansion factor
n_layers=4, # Number of residual blocks
activation='geglu', # GeGLU activation
normalization='layernorm', # Layer normalization
hidden_dropout=0.1,
residual_dropout=0.1
)
FT-Transformer with Mixed Features:
from TALENT.model.models.ftt import Transformer
# Initialize FT-Transformer
model = Transformer(
d_numerical=8, # 8 numerical features
categories=[5, 10, 3], # 3 categorical features with cardinalities
d_token=64, # Token dimension
n_layers=3, # Number of transformer layers
n_heads=8, # Attention heads
d_ffn_factor=2.0, # FFN expansion factor
attention_dropout=0.1,
ffn_dropout=0.1,
residual_dropout=0.1,
activation='reglu',
prenormalization=True,
d_out=5 # 5 classes
)
TabNet for Interpretable Classification:
from TALENT.model.models.tabnet import TabNetClassifier
# Initialize TabNet
model = TabNetClassifier(
n_steps=3, # Decision steps
gamma=1.3, # Relaxation parameter
n_independent=2, # Independent GLU layers
n_shared=2, # Shared GLU layers
momentum=0.02, # Batch norm momentum
lambda_sparse=1e-3 # Sparsity regularization
)
# Training
model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
max_epochs=100)
# Get predictions and explanations
predictions = model.predict_proba(X_test)
explanations = model.explain(X_test, normalize=True)
GRANDE for Tree-like Neural Networks:
from TALENT.model.models.grande import GRANDE
# Initialize GRANDE
model = GRANDE(
batch_size=64,
task_type='classification',
depth=4, # Tree depth
n_estimators=10, # Number of trees
dropout=0.1
)
ModernNCA with Distance-Based Learning:
from TALENT.model.models.modernNCA import ModernNCA
# Initialize ModernNCA
model = ModernNCA(
d_in=15,
d_out=4, # 4 classes
k=32, # Number of neighbors
dropout=0.1,
d_embedding=64 # Embedding dimension
)
# Training requires candidate examples
output = model(x, y, candidate_x, candidate_y, is_train=True)
ExcelFormer with Mixup Training:
from TALENT.model.models.excelformer import ExcelFormer
# Initialize ExcelFormer
model = ExcelFormer(
d_numerical=10,
d_token=64,
n_blocks=3,
attention_dropout=0.1,
ffn_dropout=0.1,
d_out=3
)
# Forward pass with feature mixup
output, masks, shuffled_ids = model(
x_num,
mix_up=True,
beta=0.5,
mtype='feat_mix'
)
TabPFN for Zero-Shot Learning:
from TALENT.model.models.tabpfn import TabPFNClassifier
# Initialize pre-trained TabPFN
model = TabPFNClassifier(device='cuda')
# No training required - just fit context
model.fit(X_train, y_train)
# Immediate predictions
predictions = model.predict_proba(X_test)
Model Selection Guidelines
For Beginners: - MLP: Simple, fast, good baseline - ResNet: Better than MLP for deeper networks
For Best Performance: - FT-Transformer: State-of-the-art on many datasets - TabNet: Excellent performance with interpretability - ModernNCA: Strong embedding-based performance
For Interpretability: - TabNet: Attention-based feature importance - GRANDE: Tree-like decision process - ProtoGate: Prototype-based explanations
For Speed: - MLP: Fastest training and inference - SNN: Lightweight with self-normalization - TabPFN: No training required
For Specific Scenarios: - TabR: Retrieval-based learning - ExcelFormer: Complex feature interactions with mixup - TANGOS: When regularization is critical