====================================
Models
====================================

Deep learning models for tabular data, implementing various state-of-the-art architectures.

This section contains all the neural network architectures implemented in TALENT, ranging from simple MLPs to advanced transformer-based models specifically designed for tabular data. Each model implements specific forward pass computations, mathematical operations, and architectural innovations.

Basic Neural Networks
=====================

Multi-Layer Perceptron (MLP)
----------------------------

.. automodule:: TALENT.model.models.mlp
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: MLP
   :noindex:

   Simple feedforward neural network with multiple fully connected layers and ReLU activations.
   
   **Mathematical Formulation:**
   
   For input :math:`x \in \mathbb{R}^{d_{in}}`, the MLP computes:
   
   .. math::
      
      h_0 &= x \\
      h_i &= \text{ReLU}(\text{Linear}(h_{i-1})) = \text{ReLU}(W_i h_{i-1} + b_i) \\
      \text{output} &= W_{\text{head}} h_L + b_{\text{head}}
   
   where :math:`L` is the number of hidden layers.

   .. method:: __init__(d_in, d_out, d_layers, dropout)
      :noindex:
      
      Initialize the MLP architecture.
      
      **Parameters:**
      
      * **d_in** (*int*) -- Input feature dimension
      * **d_out** (*int*) -- Output dimension (number of classes for classification, 1 for regression)
      * **d_layers** (*List[int]*) -- Hidden layer dimensions, e.g., [64, 32] for two hidden layers
      * **dropout** (*float*) -- Dropout probability applied after each hidden layer
      
      **Architecture Construction:**
      
      1. **Hidden Layers:** Creates `nn.Linear` layers with dimensions specified in `d_layers`
      2. **Output Head:** Final linear layer mapping to output dimension
      3. **Dropout Setup:** Configures dropout for regularization during training

   .. method:: forward(x, x_cat=None)
      :noindex:
      
      Forward pass through the MLP network.
      
      **Parameters:**
      
      * **x** (*torch.Tensor*) -- Input numerical features of shape (batch_size, d_in)
      * **x_cat** (*torch.Tensor, optional*) -- Categorical features (not used in MLP, maintained for interface consistency)
      
      **Returns:**
      
      * **torch.Tensor** -- Output predictions of shape (batch_size, d_out) or (batch_size,) for regression
      
      **Forward Pass Implementation:**
      
      .. code-block:: python
         
         for layer in self.layers:
             x = layer(x)  # Linear: x = W @ x + b
             x = F.relu(x)  # ReLU: x = max(0, x)
             if self.dropout:
                 x = F.dropout(x, self.dropout, self.training)
         
         logit = self.head(x)  # Final output layer
         if self.d_out == 1:
             logit = logit.squeeze(-1)  # For regression
      
      **ReLU Activation:**
      
      .. math::
         
         \text{ReLU}(x) = \max(0, x)
      
      **Dropout Regularization:**
      
      During training, randomly sets elements to zero with probability `dropout`:
      
      .. math::
         
         \text{Dropout}(x) = \begin{cases}
         \frac{x}{1-p} & \text{with probability } 1-p \\
         0 & \text{with probability } p
         \end{cases}

Residual Network (ResNet)
-------------------------

.. automodule:: TALENT.model.models.resnet
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: ResNet
   :noindex:

   Deep residual network with skip connections for tabular data, preventing gradient vanishing in deep architectures.
   
   **Mathematical Formulation:**
   
   ResNet uses residual blocks with skip connections:
   
   .. math::
      
      h_{i+1} = h_i + F(h_i, W_i)
   
   where :math:`F(h_i, W_i)` is the residual function.

   .. method:: __init__(d_in, d, d_hidden_factor, n_layers, activation, normalization, hidden_dropout, residual_dropout, d_out)
      :noindex:
      
      Initialize the ResNet architecture with configurable components.
      
      **Parameters:**
      
      * **d_in** (*int*) -- Input feature dimension
      * **d** (*int*) -- Hidden dimension for residual blocks
      * **d_hidden_factor** (*float*) -- Factor to scale hidden layer width within blocks
      * **n_layers** (*int*) -- Number of residual blocks
      * **activation** (*str*) -- Activation function ('relu', 'gelu', 'reglu', 'geglu')
      * **normalization** (*str*) -- Normalization type ('batchnorm', 'layernorm')
      * **hidden_dropout** (*float*) -- Dropout probability within residual blocks
      * **residual_dropout** (*float*) -- Dropout probability for residual connections
      * **d_out** (*int*) -- Output dimension

   .. method:: forward(x, x_cat=None)
      :noindex:
      
      Forward pass through the ResNet architecture.
      
      **Parameters:**
      
      * **x** (*torch.Tensor*) -- Input numerical features
      * **x_cat** (*torch.Tensor, optional*) -- Categorical features (not used)
      
      **Returns:**
      
      * **torch.Tensor** -- Output predictions
      
      **Residual Block Mathematical Implementation:**
      
      For each residual block, the computation follows:
      
      .. math::
         
         \text{residual} &= \text{Norm}(h_i) \\
         \text{residual} &= \text{Linear}(\text{residual}) \\
         \text{residual} &= \text{Activation}(\text{residual}) \\
         \text{residual} &= \text{Dropout}(\text{residual}) \\
         \text{residual} &= \text{Linear}(\text{residual}) \\
         \text{residual} &= \text{Dropout}(\text{residual}) \\
         h_{i+1} &= h_i + \text{residual}
      
      **Activation Functions:**
      
      * **ReLU:** :math:`\text{ReLU}(x) = \max(0, x)`
      * **GELU:** :math:`\text{GELU}(x) = x \cdot \Phi(x)`
      * **ReGLU:** :math:`\text{ReGLU}(x) = a \cdot \text{ReLU}(b)` where :math:`a, b = \text{split}(x)`
      * **GeGLU:** :math:`\text{GeGLU}(x) = a \cdot \text{GELU}(b)` where :math:`a, b = \text{split}(x)`

   .. function:: reglu(x)
      :noindex:
      
      ReGLU activation function for gated linear units.
      
      **Mathematical Definition:**
      
      .. math::
         
         \text{ReGLU}(x) = a \cdot \text{ReLU}(b)
      
      where :math:`a` and :math:`b` are obtained by splitting :math:`x` along the last dimension.

   .. function:: geglu(x)
      :noindex:
      
      GeGLU activation function combining gating with GELU.
      
      **Mathematical Definition:**
      
      .. math::
         
         \text{GeGLU}(x) = a \cdot \text{GELU}(b)
      
      where :math:`a` and :math:`b` are obtained by splitting :math:`x` along the last dimension.

Self-Normalizing Network (SNN)
------------------------------

.. automodule:: TALENT.model.models.snn
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: SNN
   :noindex:

   Lightweight neural network with self-normalizing properties using SELU activation.

   .. method:: __init__(d_in, d_out, d_layers, dropout)
      :noindex:
      
      Initialize SNN with SELU activations for self-normalization.
      
      **Parameters:**
      
      * **d_in** (*int*) -- Input dimension
      * **d_out** (*int*) -- Output dimension  
      * **d_layers** (*List[int]*) -- Hidden layer dimensions
      * **dropout** (*float*) -- Dropout probability

   .. method:: forward(x, x_cat=None)
      :noindex:
      
      Forward pass with SELU activation for self-normalization.
      
      **SELU Activation Mathematical Definition:**
      
      .. math::
         
         \text{SELU}(x) = \lambda \begin{cases}
         x & \text{if } x > 0 \\
         \alpha(e^x - 1) & \text{if } x \leq 0
         \end{cases}
      
      where :math:`\lambda \approx 1.0507` and :math:`\alpha \approx 1.6733`.
      
      **Self-Normalization Property:**
      
      SELU ensures that for normalized inputs, activations maintain:
      - Mean converges to 0
      - Variance converges to 1
      - Enables training of very deep networks without explicit normalization

Transformer-Based Models
========================

Feature Tokenizer Transformer (FT-Transformer)
-----------------------------------------------

.. automodule:: TALENT.model.models.ftt
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: Transformer
   :noindex:

   Advanced transformer architecture specifically designed for tabular data with feature tokenization.
   
   **Mathematical Formulation:**
   
   **Feature Tokenization:**
   
   For numerical features: :math:`t_i^{\text{num}} = W_{\text{num}} x_i + b_{\text{num}}`
   
   For categorical features: :math:`t_i^{\text{cat}} = \text{Embedding}(x_i^{\text{cat}})`

   .. method:: __init__(d_numerical, categories, d_token, n_layers, n_heads, d_ffn_factor, attention_dropout, ffn_dropout, residual_dropout, activation, prenormalization, d_out)
      :noindex:
      
      Initialize the FT-Transformer architecture.
      
      **Parameters:**
      
      * **d_numerical** (*int*) -- Number of numerical features
      * **categories** (*List[int], optional*) -- Cardinalities for categorical features
      * **d_token** (*int*) -- Token embedding dimension
      * **n_layers** (*int*) -- Number of transformer layers
      * **n_heads** (*int*) -- Number of attention heads
      * **d_ffn_factor** (*float*) -- Factor for feed-forward network dimension
      * **attention_dropout** (*float*) -- Dropout for attention weights
      * **ffn_dropout** (*float*) -- Dropout for feed-forward network
      * **residual_dropout** (*float*) -- Dropout for residual connections
      * **activation** (*str*) -- Activation function for FFN
      * **prenormalization** (*bool*) -- Whether to use pre-normalization
      * **d_out** (*int*) -- Output dimension

   .. method:: forward(x_num, x_cat)
      :noindex:
      
      Forward pass through the transformer.
      
      **Parameters:**
      
      * **x_num** (*torch.Tensor, optional*) -- Numerical features of shape (batch_size, d_numerical)
      * **x_cat** (*torch.Tensor, optional*) -- Categorical features of shape (batch_size, n_categorical)
      
      **Returns:**
      
      * **torch.Tensor** -- Output predictions
      
      **Transformer Processing Pipeline:**
      
      1. **Tokenization:** Convert features to tokens using `Tokenizer`
      2. **CLS Token Addition:** Prepend classification token
      3. **Transformer Layers:** Apply multi-head attention and feed-forward networks
      4. **Output Generation:** Use CLS token representation for final prediction
      
      **Transformer Layer Mathematical Implementation:**
      
      For each transformer layer:
      
      .. math::
         
         \text{attn_out} &= \text{MultiHeadAttention}(x, x, x) \\
         x &= \text{LayerNorm}(x + \text{attn_out}) \\
         \text{ffn_out} &= \text{FFN}(x) \\
         x &= \text{LayerNorm}(x + \text{ffn_out})

.. class:: Tokenizer
   :noindex:

   Converts numerical and categorical features into token embeddings for transformer processing.

   .. method:: __init__(d_numerical, categories, d_token, bias)
      :noindex:
      
      Initialize the feature tokenizer.
      
      **Parameters:**
      
      * **d_numerical** (*int*) -- Number of numerical features
      * **categories** (*List[int], optional*) -- Cardinalities of categorical features
      * **d_token** (*int*) -- Token embedding dimension
      * **bias** (*bool*) -- Whether to use bias in tokenization

   .. method:: forward(x_num, x_cat)
      :noindex:
      
      Convert features to token embeddings.
      
      **Tokenization Process:**
      
      **Numerical Features:**
      
      .. math::
         
         \text{tokens}_{\text{num}} = x_{\text{num}} W_{\text{num}} + b_{\text{num}}
      
      **Categorical Features:**
      
      .. math::
         
         \text{tokens}_{\text{cat}} = \text{Embedding}(x_{\text{cat}} + \text{offsets})
      
      **CLS Token:**
      
      .. math::
         
         \text{tokens}_{\text{cls}} = W_{\text{cls}}

   .. property:: n_tokens
      :noindex:
      
      Total number of tokens (numerical + categorical + CLS).
      
      **Returns:**
      
      * **int** -- Total token count

.. class:: MultiheadAttention
   :noindex:

   Multi-head attention mechanism optimized for tabular data.

   .. method:: __init__(d, n_heads, dropout, bias)
      :noindex:
      
      Initialize multi-head attention.
      
      **Parameters:**
      
      * **d** (*int*) -- Input dimension
      * **n_heads** (*int*) -- Number of attention heads
      * **dropout** (*float*) -- Attention dropout probability
      * **bias** (*bool*) -- Whether to use bias in projections

   .. method:: forward(x_q, x_kv, key_compression, value_compression)
      :noindex:
      
      Compute multi-head attention.
      
      **Parameters:**
      
      * **x_q** (*torch.Tensor*) -- Query input
      * **x_kv** (*torch.Tensor*) -- Key and value input
      * **key_compression** (*nn.Linear, optional*) -- Key compression layer
      * **value_compression** (*nn.Linear, optional*) -- Value compression layer
      
      **Returns:**
      
      * **torch.Tensor** -- Attention output
      
      **Multi-Head Attention Mathematical Implementation:**
      
      1. **Linear Projections:**
         
         .. math::
            
            Q = x_q W^Q, \quad K = x_{kv} W^K, \quad V = x_{kv} W^V
      
      2. **Scaled Dot-Product Attention:**
         
         .. math::
            
            \text{attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)
      
      3. **Output Computation:**
         
         .. math::
            
            \text{output} = \text{attention} \cdot V
      
      4. **Multi-Head Combination:**
         
         .. math::
            
            \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

Advanced Tabular Models
=======================

TabNet
------

.. automodule:: TALENT.model.models.tabnet
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: TabNetClassifier
   :noindex:

   Interpretable deep learning model with sequential attention mechanism for classification.
   
   **Mathematical Formulation:**
   
   TabNet uses sequential feature selection through sparsemax attention:
   
   **Feature Selection at Step i:**
   
   .. math::
      
      M^{[i]} = \text{sparsemax}(\text{AttentionTransformer}(f^{[i-1]}))
   
   **Feature Processing:**
   
   .. math::
      
      f^{[i]} = \gamma \odot M^{[i]} \odot h + (1-\gamma) \odot f^{[i-1]}
   
   where :math:`\gamma` is the relaxation parameter.

   .. method:: __init__(n_steps, gamma, n_independent, n_shared, momentum, optimizer_params, scheduler_params, mask_type, lambda_sparse, seed)
      :noindex:
      
      Initialize TabNet classifier.
      
      **Parameters:**
      
      * **n_steps** (*int*) -- Number of decision steps
      * **gamma** (*float*) -- Relaxation parameter for feature selection
      * **n_independent** (*int*) -- Number of independent GLU layers per step
      * **n_shared** (*int*) -- Number of shared GLU layers
      * **momentum** (*float*) -- Momentum for batch normalization
      * **optimizer_params** (*dict*) -- Optimizer configuration
      * **scheduler_params** (*dict*) -- Learning rate scheduler parameters
      * **mask_type** (*str*) -- Type of attention mask ('sparsemax' or 'entmax')
      * **lambda_sparse** (*float*) -- Sparsity regularization coefficient
      * **seed** (*int*) -- Random seed

   .. method:: fit(X_train, y_train, eval_set, eval_name, eval_metric, max_epochs, patience, batch_size, virtual_batch_size, num_workers, drop_last, callbacks)
      :noindex:
      
      Train the TabNet model.
      
      **Training Process:**
      
      1. **Data Preprocessing:** Handle categorical encoding and normalization
      2. **Sequential Training:** Train each decision step sequentially
      3. **Attention Regularization:** Apply sparsity constraints on attention masks
      4. **Early Stopping:** Monitor validation metrics for convergence

   .. method:: predict_proba(X)
      :noindex:
      
      Make probability predictions for classification.
      
      **Parameters:**
      
      * **X** (*torch.Tensor or scipy.sparse matrix*) -- Input features
      
      **Returns:**
      
      * **np.ndarray** -- Class probabilities of shape (n_samples, n_classes)
      
      **Prediction Process:**
      
      1. **Forward Pass:** Process through all decision steps
      2. **Attention Aggregation:** Combine attention from all steps
      3. **Softmax Application:** Convert logits to probabilities
      
      .. math::
         
         P(y=k|x) = \frac{\exp(o_k)}{\sum_{j=1}^K \exp(o_j)}
      
      where :math:`o_k` is the raw output for class :math:`k`.

   .. method:: explain(X, normalize)
      :noindex:
      
      Generate feature importance explanations using attention masks.
      
      **Parameters:**
      
      * **X** (*torch.Tensor*) -- Input features
      * **normalize** (*bool*) -- Whether to normalize importance scores
      
      **Returns:**
      
      * **np.ndarray** -- Feature importance matrix
      
      **Explanation Generation:**
      
      Attention masks from each decision step provide interpretable feature importance:
      
      .. math::
         
         \text{importance}_{ij} = \frac{M^{[i]}_j}{\sum_{k=1}^{n_features} M^{[i]}_k}

.. class:: TabNetRegressor
   :noindex:

   TabNet for regression tasks with mean squared error optimization.

   .. method:: compute_loss(y_pred, y_true)
      :noindex:
      
      Compute mean squared error loss for regression.
      
      **MSE Loss Mathematical Definition:**
      
      .. math::
         
         \mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2

Tree-Based Neural Models
========================

GRANDE (Gradient-Boosted Neural Decision Ensembles)
---------------------------------------------------

.. automodule:: TALENT.model.models.grande
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: GRANDE
   :noindex:

   Tree-mimic neural network using gradient descent for decision tree simulation.
   
   **Mathematical Formulation:**
   
   GRANDE simulates decision trees using neural operations with entmax for sparse selection.

   .. method:: __init__(batch_size, task_type, depth, n_estimators, dropout)
      :noindex:
      
      Initialize GRANDE model.
      
      **Parameters:**
      
      * **batch_size** (*int*) -- Training batch size
      * **task_type** (*str*) -- 'classification' or 'regression'
      * **depth** (*int*) -- Maximum tree depth
      * **n_estimators** (*int*) -- Number of tree estimators
      * **dropout** (*float*) -- Dropout probability

   .. method:: forward(inputs)
      :noindex:
      
      Forward pass through the GRANDE ensemble.
      
      **Parameters:**
      
      * **inputs** (*torch.Tensor*) -- Input features
      
      **Returns:**
      
      * **torch.Tensor** -- Ensemble predictions
      
      **Tree Simulation Mathematical Implementation:**
      
      1. **Split Decision Computation:**
         
         .. math::
            
            \text{node_result} = \frac{\text{softsign}(s_1 - s_2) + 1}{2}
         
         where :math:`s_1` are learned split thresholds and :math:`s_2` are feature values.
      
      2. **Path Probability Calculation:**
         
         .. math::
            
            p = \prod_{j} ((1-\text{path_id}_j) \cdot \text{node_result}_j + \text{path_id}_j \cdot (1-\text{node_result}_j))
      
      3. **Ensemble Output for Regression:**
         
         .. math::
            
            \text{output} = \sum_{e,l} w_e \cdot p_{e,l} \cdot v_{e,l}
         
         where :math:`w_e` are estimator weights, :math:`p_{e,l}` are leaf probabilities, and :math:`v_{e,l}` are leaf values.
      
      4. **Ensemble Output for Classification:**
         
         .. math::
            
            \text{output} = \sum_{e,l} w_e \cdot p_{e,l} \cdot \text{softmax}(v_{e,l})

   .. method:: get_representation(inputs)
      :noindex:
      
      Extract intermediate tree representations for analysis.
      
      **Returns:**
      
      * **torch.Tensor** -- Tree path representations

Neural Oblivious Decision Ensembles (NODE)
------------------------------------------

.. automodule:: TALENT.model.models.node
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: Node
   :noindex:

   Neural implementation of oblivious decision trees with differentiable splits.

   .. method:: __init__(input_dim, layer_dim, output_dim, num_layers, tree_dim, depth, choice_function, bin_function)
      :noindex:
      
      Initialize NODE architecture.
      
      **Parameters:**
      
      * **input_dim** (*int*) -- Input feature dimension
      * **layer_dim** (*int*) -- Hidden layer dimension
      * **output_dim** (*int*) -- Output dimension
      * **num_layers** (*int*) -- Number of NODE layers
      * **tree_dim** (*int*) -- Number of trees per layer
      * **depth** (*int*) -- Tree depth
      * **choice_function** (*str*) -- Function for feature selection ('entmax15')
      * **bin_function** (*str*) -- Function for threshold selection ('entmoid15')

   .. method:: forward(x)
      :noindex:
      
      Forward pass through oblivious decision trees.
      
      **Decision Tree Mathematical Process:**
      
      1. **Feature Selection:** Use entmax for sparse feature selection
      2. **Threshold Comparison:** Compare features with learned thresholds
      3. **Path Aggregation:** Aggregate predictions along tree paths
      4. **Ensemble Combination:** Combine outputs from multiple trees

GrowNet (Gradient Boosting with Neural Networks)
------------------------------------------------

.. automodule:: TALENT.model.models.grownet
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: GrowNet
   :noindex:

   Gradient boosting framework with neural network weak learners.

   .. method:: __init__(input_dim, output_dim, boost_rate, layers_per_net, layer_dims, dropout)
      :noindex:
      
      Initialize GrowNet with neural weak learners.
      
      **Gradient Boosting Process:**
      
      1. **Weak Learner Training:** Train neural networks on residuals
      2. **Boosting Update:** Add weak learners with adaptive weights
      3. **Gradient Computation:** Compute gradients for next weak learner

   .. method:: forward(x)
      :noindex:
      
      Forward pass through the boosted ensemble.
      
      **Boosting Mathematical Formulation:**
      
      .. math::
         
         F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)
      
      where :math:`h_m` is the m-th weak learner and :math:`\gamma_m` is the boosting rate.

Distance-Based Models
====================

Modern Neighborhood Component Analysis (ModernNCA)
-----------------------------------------

.. automodule:: TALENT.model.models.modernNCA
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: ModernNCA
   :noindex:

   Neighborhood Component Analysis-inspired model for embedding-based predictions.
   
   **Mathematical Formulation:**
   
   ModernNCA learns embeddings for distance-based classification.

   .. method:: __init__(d_in, d_out, k, dropout, d_embedding)
      :noindex:
      
      Initialize ModernNCA model.
      
      **Parameters:**
      
      * **d_in** (*int*) -- Input feature dimension
      * **d_out** (*int*) -- Output dimension (number of classes)
      * **k** (*int*) -- Number of nearest neighbors to consider
      * **dropout** (*float*) -- Dropout probability
      * **d_embedding** (*int*) -- Embedding dimension

   .. method:: forward(x, y, candidate_x, candidate_y, is_train)
      :noindex:
      
      Forward pass with neighborhood analysis.
      
      **Parameters:**
      
      * **x** (*torch.Tensor*) -- Query features
      * **y** (*torch.Tensor*) -- Query labels
      * **candidate_x** (*torch.Tensor*) -- Candidate features for nearest neighbor search
      * **candidate_y** (*torch.Tensor*) -- Candidate labels
      * **is_train** (*bool*) -- Training mode flag
      
      **Returns:**
      
      * **torch.Tensor** -- Distance-based predictions
      
      **Distance-Based Prediction Mathematical Implementation:**
      
      1. **Embedding Computation:**
         
         .. math::
            
            e_i = f(x_i), \quad e_j = f(x_j)
         
         where :math:`f` is the learned embedding function.
      
      2. **Distance Computation:**
         
         .. math::
            
            d(x_i, x_j) = ||e_i - e_j||_2
      
      3. **Neighbor Weighting:**
         
         .. math::
            
            p_{ij} = \frac{\exp(-d(x_i, x_j))}{\sum_{k \neq i} \exp(-d(x_i, x_k))}
      
      4. **Final Prediction:**
         
         .. math::
            
            \hat{y}_i = \sum_j p_{ij} y_j

   .. method:: knn_prediction(x, candidate_x, candidate_y, k)
      :noindex:
      
      Make predictions using k-nearest neighbors in embedding space.
      
      **K-NN Process:**
      
      1. **Distance Calculation:** Compute distances in embedding space
      2. **Neighbor Selection:** Find k nearest neighbors
      3. **Prediction Aggregation:** Aggregate neighbor labels with distance weighting

Specialized Architectures
=========================

ExcelFormer (Semi-Permeable Attention)
--------------------------------------

.. automodule:: TALENT.model.models.excelformer
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: ExcelFormer
   :noindex:

   Transformer with semi-permeable attention and mixup training capabilities.

   .. method:: __init__(d_numerical, d_token, n_blocks, attention_dropout, ffn_dropout, residual_dropout, d_out)
      :noindex:
      
      Initialize ExcelFormer architecture.
      
      **Parameters:**
      
      * **d_numerical** (*int*) -- Number of numerical features
      * **d_token** (*int*) -- Token embedding dimension
      * **n_blocks** (*int*) -- Number of transformer blocks
      * **attention_dropout** (*float*) -- Attention dropout probability
      * **ffn_dropout** (*float*) -- Feed-forward dropout probability
      * **residual_dropout** (*float*) -- Residual connection dropout
      * **d_out** (*int*) -- Output dimension

   .. method:: forward(x_num, x_cat, mix_up, beta, mtype)
      :noindex:
      
      Forward pass with optional mixup augmentation.
      
      **Parameters:**
      
      * **x_num** (*torch.Tensor*) -- Numerical features
      * **x_cat** (*torch.Tensor, optional*) -- Categorical features
      * **mix_up** (*bool*) -- Whether to apply mixup
      * **beta** (*float*) -- Mixup parameter (default: 0.5)
      * **mtype** (*str*) -- Mixup type ('feat_mix', 'hidden_mix', 'naive_mix')
      
      **Returns:**
      
      * **tuple** -- (output, feat_masks, shuffled_ids) for mixup training
      
      **Mixup Mathematical Implementation:**
      
      **Feature Mixup:**
      
      .. math::
         
         \tilde{x} = \lambda x_i + (1-\lambda) x_j
      
      **Semi-Permeable Attention:**
      
      .. math::
         
         \text{Attention}_{\text{perm}}(Q, K, V) = \text{mask} \odot \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

   .. method:: mixup_process(x, beta, mtype)
      :noindex:
      
      Apply mixup augmentation to input features.
      
      **Mixup Types:**
      
      * **feat_mix:** Feature-level mixing with learnable weights
      * **hidden_mix:** Hidden representation mixing
      * **naive_mix:** Simple linear interpolation

ProtoGate (Prototype-Based Gating)
----------------------------------

.. automodule:: TALENT.model.models.protogate
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: ProtoGate
   :noindex:

   Prototype-based model with gating mechanisms for interpretable feature selection.

   .. method:: __init__(input_dim, output_dim, n_prototypes, n_components, dropout)
      :noindex:
      
      Initialize ProtoGate architecture.
      
      **Parameters:**
      
      * **input_dim** (*int*) -- Input feature dimension
      * **output_dim** (*int*) -- Output dimension
      * **n_prototypes** (*int*) -- Number of learned prototypes
      * **n_components** (*int*) -- Number of components per prototype
      * **dropout** (*float*) -- Dropout probability

   .. method:: forward(x)
      :noindex:
      
      Forward pass with prototype-based gating.
      
      **Prototype-Based Processing:**
      
      1. **Prototype Computation:** Learn representative prototypes from data
      2. **Distance Calculation:** Compute distances to prototypes
      3. **Gate Generation:** Use distances to generate feature gates
      4. **Feature Selection:** Apply gates for adaptive feature selection

.. class:: GatingNet
   :noindex:

   Gating network for prototype-based feature selection.

   .. method:: hard_sigmoid(x)
      :noindex:
      
      Hard sigmoid activation for efficient gating.
      
      **Hard Sigmoid Mathematical Definition:**
      
      .. math::
         
         \text{hard_sigmoid}(x) = \max(0, \min(1, \frac{x + 1}{2}))
      
      This provides a piecewise linear approximation to the sigmoid function for computational efficiency.

   .. method:: forward(x)
      :noindex:
      
      Generate gating weights for feature selection.

Retrieval-Based Models
=====================

TabR (Tabular Retrieval)
------------------------

.. automodule:: TALENT.model.models.tabr
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: TabR
   :noindex:

   KNN-attention hybrid model with retrieval-based predictions.

   .. method:: __init__(n_num_features, n_cat_features, n_classes, context_size, normalization, num_embeddings, d_main, d_multiplier, encoder_n_blocks, predictor_n_blocks, mixer_normalization, dropout0, dropout1, normalization, activation)
      :noindex:
      
      Initialize TabR architecture.
      
      **Parameters:**
      
      * **n_num_features** (*int*) -- Number of numerical features
      * **n_cat_features** (*int*) -- Number of categorical features  
      * **n_classes** (*int*) -- Number of output classes
      * **context_size** (*int*) -- Maximum context size for retrieval
      * **normalization** (*str*) -- Normalization type
      * **num_embeddings** (*dict*) -- Embedding configurations
      * **d_main** (*int*) -- Main hidden dimension
      * **d_multiplier** (*int*) -- Dimension multiplier
      * **encoder_n_blocks** (*int*) -- Number of encoder blocks
      * **predictor_n_blocks** (*int*) -- Number of predictor blocks
      * **mixer_normalization** (*str*) -- Mixer normalization type
      * **dropout0** (*float*) -- Input dropout
      * **dropout1** (*float*) -- Hidden dropout
      * **activation** (*str*) -- Activation function

   .. method:: forward(x_num, x_cat, candidate_x_num, candidate_x_cat, candidate_y, context_size, is_train)
      :noindex:
      
      Forward pass with retrieval-based attention.
      
      **Retrieval Process:**
      
      1. **Context Selection:** Select relevant examples from training set
      2. **Attention Computation:** Apply attention over retrieved candidates
      3. **Feature Processing:** Process query and candidate features
      4. **Prediction Generation:** Combine retrieval and learned representations

Foundation Models
================

TabPFN (Tabular Prior-Fitting Networks)
---------------------------------------

.. automodule:: TALENT.model.models.tabpfn
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: TabPFNClassifier
   :noindex:

   Prior-fitting network for zero-shot tabular classification.

   .. method:: __init__(device, base_path)
      :noindex:
      
      Initialize TabPFN with pre-trained weights.
      
      **Foundation Model Features:**
      
      * Pre-trained on diverse tabular datasets
      * No gradient-based training required
      * Immediate deployment capability
      * Context-based learning from examples

   .. method:: fit(X, y)
      :noindex:
      
      Fit the model using in-context learning (no parameter updates).
      
      **In-Context Learning Process:**
      
      1. **Context Setup:** Store training examples as context
      2. **No Weight Updates:** Model weights remain frozen
      3. **Context Encoding:** Encode training data for reference

   .. method:: predict_proba(X)
      :noindex:
      
      Make predictions using in-context learning.
      
      **Zero-Shot Prediction:**
      
      1. **Context Retrieval:** Use stored training context
      2. **Attention Mechanism:** Apply attention over training examples
      3. **Prediction Generation:** Generate predictions without fine-tuning

Regularization Methods
=====================

TANGOS Regularization
--------------------

.. automodule:: TALENT.model.models.tangos
   :members:
   :undoc-members:
   :show-inheritance:

.. class:: Tangos
   :noindex:

   MLP with TANGOS regularization for neuron specialization.
   
   **Mathematical Formulation:**
   
   TANGOS applies spatial and spectral regularization to encourage neuron specialization:
   
   .. math::
      
      \mathcal{L}_{\text{TANGOS}} = \mathcal{L}_{\text{task}} + \lambda_1 \mathcal{L}_{\text{spatial}} + \lambda_2 \mathcal{L}_{\text{spectral}}

   .. method:: __init__(d_in, d_out, d_layers, dropout, lambda1, lambda2)
      :noindex:
      
      Initialize TANGOS-regularized MLP.
      
      **Parameters:**
      
      * **d_in** (*int*) -- Input dimension
      * **d_out** (*int*) -- Output dimension
      * **d_layers** (*List[int]*) -- Hidden layer dimensions
      * **dropout** (*float*) -- Dropout probability
      * **lambda1** (*float*) -- Spatial regularization weight
      * **lambda2** (*float*) -- Spectral regularization weight

   .. method:: forward(x, x_cat=None)
      :noindex:
      
      Forward pass with standard MLP architecture.

   .. method:: cal_representation(x)
      :noindex:
      
      Calculate intermediate representations for regularization.
      
      **Parameters:**
      
      * **x** (*torch.Tensor*) -- Input features
      
      **Returns:**
      
      * **torch.Tensor** -- Hidden representations before final layer
      
      **Representation Extraction Process:**
      
      The method extracts intermediate representations by stopping before the final layer:
      
      .. code-block:: python
         
         for i, layer in enumerate(self.layers):
             x = layer(x)
             x = F.relu(x)
             if self.dropout and i != len(self.layers) - 1:
                 x = F.dropout(x, self.dropout, self.training)
         return x  # Return before final head layer
      
      **Regularization Applications:**
      
      * **Spatial Regularization:** Encourages spatial locality in neuron activations
      * **Spectral Regularization:** Promotes spectral diversity in learned representations

Activation Functions Reference
==============================

**Standard Activations:**

.. math::
   
   \text{ReLU}(x) = \max(0, x)

.. math::
   
   \text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

.. math::
   
   \text{SELU}(x) = \lambda \begin{cases}
   x & \text{if } x > 0 \\
   \alpha(e^x - 1) & \text{if } x \leq 0
   \end{cases}

**Gated Activations:**

.. math::
   
   \text{ReGLU}(x) = a \cdot \text{ReLU}(b) \text{ where } [a, b] = \text{split}(x)

.. math::
   
   \text{GeGLU}(x) = a \cdot \text{GELU}(b) \text{ where } [a, b] = \text{split}(x)

**Probability Functions:**

.. math::
   
   \text{Softmax}(x_i) = \frac{\exp(x_i)}{\sum_{j=1}^K \exp(x_j)}

.. math::
   
   \text{Sparsemax}(z) = \arg\min_{p \in \Delta^{K-1}} ||p - z||_2^2

where :math:`\Delta^{K-1}` is the probability simplex.

Model Usage Examples
===================

**Basic MLP Usage:**

.. code-block:: python
   
   from TALENT.model.models.mlp import MLP
   
   # Initialize MLP
   model = MLP(
       d_in=10,           # Input dimension
       d_out=3,           # Output dimension (3 classes)
       d_layers=[64, 32], # Hidden layer sizes
       dropout=0.1        # Dropout probability
   )
   
   # Forward pass
   x = torch.randn(32, 10)  # Batch of 32 samples, 10 features
   output = model(x)        # Shape: (32, 3)

**ResNet with Advanced Activations:**

.. code-block:: python
   
   from TALENT.model.models.resnet import ResNet
   
   # Initialize ResNet with GeGLU activation
   model = ResNet(
       d_in=15,
       d_out=1,                    # Regression task
       d=128,                      # Hidden dimension
       d_hidden_factor=2.0,        # Hidden expansion factor
       n_layers=4,                 # Number of residual blocks
       activation='geglu',         # GeGLU activation
       normalization='layernorm',  # Layer normalization
       hidden_dropout=0.1,
       residual_dropout=0.1
   )

**FT-Transformer with Mixed Features:**

.. code-block:: python
   
   from TALENT.model.models.ftt import Transformer
   
   # Initialize FT-Transformer
   model = Transformer(
       d_numerical=8,          # 8 numerical features
       categories=[5, 10, 3],  # 3 categorical features with cardinalities
       d_token=64,             # Token dimension
       n_layers=3,             # Number of transformer layers
       n_heads=8,              # Attention heads
       d_ffn_factor=2.0,       # FFN expansion factor
       attention_dropout=0.1,
       ffn_dropout=0.1,
       residual_dropout=0.1,
       activation='reglu',
       prenormalization=True,
       d_out=5                 # 5 classes
   )

**TabNet for Interpretable Classification:**

.. code-block:: python
   
   from TALENT.model.models.tabnet import TabNetClassifier
   
   # Initialize TabNet
   model = TabNetClassifier(
       n_steps=3,              # Decision steps
       gamma=1.3,              # Relaxation parameter
       n_independent=2,        # Independent GLU layers
       n_shared=2,             # Shared GLU layers
       momentum=0.02,          # Batch norm momentum
       lambda_sparse=1e-3      # Sparsity regularization
   )
   
   # Training
   model.fit(X_train, y_train, 
             eval_set=[(X_val, y_val)],
             max_epochs=100)
   
   # Get predictions and explanations
   predictions = model.predict_proba(X_test)
   explanations = model.explain(X_test, normalize=True)

**GRANDE for Tree-like Neural Networks:**

.. code-block:: python
   
   from TALENT.model.models.grande import GRANDE
   
   # Initialize GRANDE
   model = GRANDE(
       batch_size=64,
       task_type='classification',
       depth=4,              # Tree depth
       n_estimators=10,      # Number of trees
       dropout=0.1
   )

**ModernNCA with Distance-Based Learning:**

.. code-block:: python
   
   from TALENT.model.models.modernNCA import ModernNCA
   
   # Initialize ModernNCA
   model = ModernNCA(
       d_in=15,
       d_out=4,              # 4 classes
       k=32,                 # Number of neighbors
       dropout=0.1,
       d_embedding=64        # Embedding dimension
   )
   
   # Training requires candidate examples
   output = model(x, y, candidate_x, candidate_y, is_train=True)

**ExcelFormer with Mixup Training:**

.. code-block:: python
   
   from TALENT.model.models.excelformer import ExcelFormer
   
   # Initialize ExcelFormer
   model = ExcelFormer(
       d_numerical=10,
       d_token=64,
       n_blocks=3,
       attention_dropout=0.1,
       ffn_dropout=0.1,
       d_out=3
   )
   
   # Forward pass with feature mixup
   output, masks, shuffled_ids = model(
       x_num, 
       mix_up=True, 
       beta=0.5, 
       mtype='feat_mix'
   )

**TabPFN for Zero-Shot Learning:**

.. code-block:: python
   
   from TALENT.model.models.tabpfn import TabPFNClassifier
   
   # Initialize pre-trained TabPFN
   model = TabPFNClassifier(device='cuda')
   
   # No training required - just fit context
   model.fit(X_train, y_train)
   
   # Immediate predictions
   predictions = model.predict_proba(X_test)

Model Selection Guidelines
=========================

**For Beginners:**
- **MLP:** Simple, fast, good baseline
- **ResNet:** Better than MLP for deeper networks

**For Best Performance:**
- **FT-Transformer:** State-of-the-art on many datasets
- **TabNet:** Excellent performance with interpretability
- **ModernNCA:** Strong embedding-based performance

**For Interpretability:**
- **TabNet:** Attention-based feature importance
- **GRANDE:** Tree-like decision process
- **ProtoGate:** Prototype-based explanations

**For Speed:**
- **MLP:** Fastest training and inference
- **SNN:** Lightweight with self-normalization
- **TabPFN:** No training required

**For Specific Scenarios:**
- **TabR:** Retrieval-based learning
- **ExcelFormer:** Complex feature interactions with mixup
- **TANGOS:** When regularization is critical