Optimizers API
Torchium provides 65+ advanced optimizers organized into several categories, extending PyTorch’s native optimizer collection with state-of-the-art algorithms from recent research.
Second-Order Optimizers
These optimizers use second-order information for faster convergence.
LBFGS
- class torchium.optimizers.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)[source]
Bases:
Optimizer
Limited-memory Broyden-Fletcher-Goldfarb-Shanno optimizer
Shampoo
AdaHessian
- class torchium.optimizers.AdaHessian(params, lr=0.15, betas=(0.9, 0.999), eps=0.0001, weight_decay=0, hessian_power=1, update_each=1, n_samples=1, avg_conv_kernel=False)[source]
Bases:
Optimizer
AdaHessian optimizer using second-order information
KFAC
- class torchium.optimizers.KFAC(params, lr=0.001, momentum=0.9, weight_decay=0, damping=0.001, TCov=10, TInv=100, batch_averaged=True)[source]
Bases:
Optimizer
K-FAC (Kronecker-Factored Approximate Curvature) optimizer
NaturalGradient
Meta-Optimizers
Sharpness-Aware Minimization (SAM) family and gradient manipulation methods.
SAM
GSAM
ASAM
LookSAM
WSAM
GradientCentralization
PCGrad
GradNorm
Experimental Optimizers
Evolutionary and nature-inspired optimization algorithms.
CMA-ES
DifferentialEvolution
ParticleSwarmOptimization
QuantumAnnealing
GeneticAlgorithm
- class torchium.optimizers.GeneticAlgorithm(params, popsize=50, mutation_rate=0.1, crossover_rate=0.8, elite_ratio=0.1, seed=None, bounds=None)[source]
Bases:
Optimizer
Genetic Algorithm optimizer
Adaptive Optimizers
Adam Variants
- class torchium.optimizers.Adam(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, **kwargs)[source]
Bases:
Adam
Adam optimizer with enhanced features.
- class torchium.optimizers.AdamW(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, **kwargs)[source]
Bases:
AdamW
AdamW optimizer with decoupled weight decay.
- class torchium.optimizers.RAdam(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, **kwargs)[source]
Bases:
Optimizer
RAdam: Rectified Adam optimizer.
Reference: https://arxiv.org/abs/1908.03265
- class torchium.optimizers.AdaBelief(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, **kwargs)[source]
Bases:
Optimizer
AdaBelief: Adapting Step-sizes by the Belief in Observed Gradients.
Reference: https://arxiv.org/abs/2010.07468
- class torchium.optimizers.AdaBound(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), final_lr: float = 0.1, gamma: float = 0.001, eps: float = 1e-08, weight_decay: float = 0, amsbound: bool = False, **kwargs)[source]
Bases:
Optimizer
AdaBound: Adaptive Gradient Methods with Dynamic Bound of Learning Rate.
Reference: https://openreview.net/forum?id=Bkg3g2R9FX
- class torchium.optimizers.AdaHessian(params, lr=0.15, betas=(0.9, 0.999), eps=0.0001, weight_decay=0, hessian_power=1, update_each=1, n_samples=1, avg_conv_kernel=False)[source]
Bases:
Optimizer
AdaHessian optimizer using second-order information
- class torchium.optimizers.AdamP(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, delta: float = 0.1, wd_ratio: float = 0.1, nesterov: bool = False, **kwargs)[source]
Bases:
Optimizer
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights.
Reference: https://arxiv.org/abs/2006.08217
Adagrad Variants
- class torchium.optimizers.Adagrad(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, lr_decay: float = 0, weight_decay: float = 0, initial_accumulator_value: float = 0, eps: float = 1e-10, **kwargs)[source]
Bases:
Adagrad
Adagrad optimizer with enhanced features.
- class torchium.optimizers.Adadelta(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, rho: float = 0.9, eps: float = 1e-06, weight_decay: float = 0, **kwargs)[source]
Bases:
Adadelta
Adadelta optimizer with enhanced features.
- class torchium.optimizers.AdaFactor(params: List[Tensor] | Dict[str, Any], lr: float | None = None, eps2: float = 1e-30, cliping_threshold: float = 1.0, decay_rate: float = -0.8, beta1: float | None = None, weight_decay: float = 0.0, scale_parameter: bool = True, relative_step: bool = True, **kwargs)[source]
Bases:
Optimizer
AdaFactor: Adaptive Learning Rates with Sublinear Memory Cost.
Reference: https://arxiv.org/abs/1804.04235
RMSprop Variants
- class torchium.optimizers.RMSprop(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0, centered: bool = False, **kwargs)[source]
Bases:
RMSprop
RMSprop optimizer with enhanced features.
- class torchium.optimizers.Yogi(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, betas: tuple = (0.9, 0.999), eps: float = 0.001, initial_accumulator: float = 1e-06, weight_decay: float = 0, **kwargs)[source]
Bases:
Optimizer
Yogi: Adaptive Methods for Nonconvex Optimization.
Reference: https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization
Momentum-Based Optimizers
SGD Variants
- class torchium.optimizers.SGD(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False, **kwargs)[source]
Bases:
SGD
SGD optimizer with enhanced features.
- class torchium.optimizers.HeavyBall(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0.9, weight_decay: float = 0, **kwargs)[source]
Bases:
Optimizer
Heavy Ball momentum optimizer.
Reference: Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.
Specialized Optimizers
Computer Vision
- class torchium.optimizers.Ranger(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, alpha: float = 0.5, k: int = 6, n_sma_threshhold: int = 5, betas: tuple = (0.95, 0.999), eps: float = 1e-05, weight_decay: float = 0, **kwargs)[source]
Bases:
Optimizer
Ranger: A synergistic optimizer combining RAdam and LookAhead.
Reference: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
NLP Optimizers
- class torchium.optimizers.LAMB(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.01, clamp_value: float = 10.0, **kwargs)[source]
Bases:
Optimizer
LAMB: Large Batch Optimization for Deep Learning.
Reference: https://arxiv.org/abs/1904.00962
- class torchium.optimizers.NovoGrad(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.95, 0.98), eps: float = 1e-08, weight_decay: float = 0, grad_averaging: bool = True, amsgrad: bool = False, **kwargs)[source]
Bases:
Optimizer
NovoGrad: Stochastic Gradient Methods with Layer-wise Adaptive Moments.
Reference: https://arxiv.org/abs/1905.11286
Sparse Data
- class torchium.optimizers.SparseAdam(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, **kwargs)[source]
Bases:
SparseAdam
SparseAdam optimizer with enhanced features.
- class torchium.optimizers.SM3(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0.0, eps: float = 1e-08, **kwargs)[source]
Bases:
Optimizer
SM3: Memory-Efficient Adaptive Optimization.
Reference: https://arxiv.org/abs/1901.11150
- class torchium.optimizers.FTRL(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, lr_power: float = -0.5, l1_regularization_strength: float = 0.0, l2_regularization_strength: float = 0.0, initial_accumulator_value: float = 0.1, **kwargs)[source]
Bases:
Optimizer
FTRL: Follow The Regularized Leader optimizer.
Reference: https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf
Distributed Training
- class torchium.optimizers.LARS(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, momentum: float = 0.9, weight_decay: float = 0.0001, trust_coefficient: float = 0.001, eps: float = 1e-08, **kwargs)[source]
Bases:
Optimizer
LARS: Layer-wise Adaptive Rate Scaling.
Reference: https://arxiv.org/abs/1708.03888
General Purpose
- class torchium.optimizers.Lion(params: List[Tensor] | Dict[str, Any], lr: float = 0.0001, betas: tuple = (0.9, 0.99), weight_decay: float = 0.0, **kwargs)[source]
Bases:
Optimizer
Lion: Symbolic Discovery of Optimization Algorithms.
Reference: https://arxiv.org/abs/2302.06675
- class torchium.optimizers.MADGRAD(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-06, **kwargs)[source]
Bases:
Optimizer
MADGRAD: A Momentumized, Adaptive, Dual Averaged Gradient Method.
Reference: https://arxiv.org/abs/2101.11075
PyTorch Native Optimizers
For completeness, Torchium also includes all PyTorch native optimizers:
- class torchium.optimizers.NAdam(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, momentum_decay: float = 0.004, decoupled_weight_decay: bool = False, *, foreach: bool | None = None, maximize: bool = False, capturable: bool = False, differentiable: bool = False)[source]
Bases:
Optimizer
Implements NAdam algorithm.
\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma_t \text{ (lr)}, \: \beta_1,\beta_2 \text{ (betas)}, \: \theta_0 \text{ (params)}, \: f(\theta) \text{ (objective)} \\ &\hspace{13mm} \: \lambda \text{ (weight decay)}, \:\psi \text{ (momentum decay)} \\ &\hspace{13mm} \: \textit{decoupled\_weight\_decay}, \:\textit{maximize} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, v_0 \leftarrow 0 \text{ ( second moment)} \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}\textbf{if} \: \textit{maximize}: \\ &\hspace{10mm}g_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \theta_t \leftarrow \theta_{t-1} \\ &\hspace{5mm} \textbf{if} \: \lambda \neq 0 \\ &\hspace{10mm}\textbf{if} \: \textit{decoupled\_weight\_decay} \\ &\hspace{15mm} \theta_t \leftarrow \theta_{t-1} - \gamma \lambda \theta_{t-1} \\ &\hspace{10mm}\textbf{else} \\ &\hspace{15mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm} \mu_t \leftarrow \beta_1 \big(1 - \frac{1}{2} 0.96^{t \psi} \big) \\ &\hspace{5mm} \mu_{t+1} \leftarrow \beta_1 \big(1 - \frac{1}{2} 0.96^{(t+1)\psi}\big)\\ &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ &\hspace{5mm}\widehat{m_t} \leftarrow \mu_{t+1} m_t/(1-\prod_{i=1}^{t+1}\mu_i)\\[-1.ex] & \hspace{11mm} + (1-\mu_t) g_t /(1-\prod_{i=1}^{t} \mu_{i}) \\ &\hspace{5mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t}/ \big(\sqrt{\widehat{v_t}} + \epsilon \big) \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]For further details regarding the algorithm we refer to Incorporating Nesterov Momentum into Adam.
- Parameters:
params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named
lr (float, Tensor, optional) – learning rate (default: 2e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
momentum_decay (float, optional) – momentum momentum_decay (default: 4e-3)
decoupled_weight_decay (bool, optional) – whether to decouple the weight decay as in AdamW to obtain NAdamW. If True, the algorithm does not accumulate weight decay in the momentum nor variance. (default: False)
foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)
maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)
capturable (bool, optional) – whether this instance is safe to capture in a graph, whether for CUDA graphs or for torch.compile support. Tensors are only capturable when on supported accelerators. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)
differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)
- __init__(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, momentum_decay: float = 0.004, decoupled_weight_decay: bool = False, *, foreach: bool | None = None, maximize: bool = False, capturable: bool = False, differentiable: bool = False)[source]
- class torchium.optimizers.Rprop(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, etas: tuple[float, float] = (0.5, 1.2), step_sizes: tuple[float, float] = (1e-06, 50), *, capturable: bool = False, foreach: bool | None = None, maximize: bool = False, differentiable: bool = False)[source]
Bases:
Optimizer
Implements the resilient backpropagation algorithm.
\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \theta_0 \in \mathbf{R}^d \text{ (params)},f(\theta) \text{ (objective)}, \\ &\hspace{13mm} \eta_{+/-} \text{ (etaplus, etaminus)}, \Gamma_{max/min} \text{ (step sizes)} \\ &\textbf{initialize} : g^0_{prev} \leftarrow 0, \: \eta_0 \leftarrow \text{lr (learning rate)} \\ &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \textbf{for} \text{ } i = 0, 1, \ldots, d-1 \: \mathbf{do} \\ &\hspace{10mm} \textbf{if} \: g^i_{prev} g^i_t > 0 \\ &\hspace{15mm} \eta^i_t \leftarrow \mathrm{min}(\eta^i_{t-1} \eta_{+}, \Gamma_{max}) \\ &\hspace{10mm} \textbf{else if} \: g^i_{prev} g^i_t < 0 \\ &\hspace{15mm} \eta^i_t \leftarrow \mathrm{max}(\eta^i_{t-1} \eta_{-}, \Gamma_{min}) \\ &\hspace{15mm} g^i_t \leftarrow 0 \\ &\hspace{10mm} \textbf{else} \: \\ &\hspace{15mm} \eta^i_t \leftarrow \eta^i_{t-1} \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1}- \eta_t \mathrm{sign}(g_t) \\ &\hspace{5mm}g_{prev} \leftarrow g_t \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]For further details regarding the algorithm we refer to the paper A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm.
- Parameters:
params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named
lr (float, optional) – learning rate (default: 1e-2)
etas (Tuple[float, float], optional) – pair of (etaminus, etaplus), that are multiplicative increase and decrease factors (default: (0.5, 1.2))
step_sizes (Tuple[float, float], optional) – a pair of minimal and maximal allowed step sizes (default: (1e-6, 50))
capturable (bool, optional) – whether this instance is safe to capture in a graph, whether for CUDA graphs or for torch.compile support. Tensors are only capturable when on supported accelerators. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)
foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)
maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)
differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)
- __init__(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, etas: tuple[float, float] = (0.5, 1.2), step_sizes: tuple[float, float] = (1e-06, 50), *, capturable: bool = False, foreach: bool | None = None, maximize: bool = False, differentiable: bool = False)[source]
Usage Examples
Basic Usage
import torch
import torch.nn as nn
import torchium
model = nn.Linear(10, 1)
# Use SAM optimizer for better generalization
optimizer = torchium.optimizers.SAM(
model.parameters(),
lr=1e-3,
rho=0.05
)
Advanced Usage
# Different learning rates for different layers
param_groups = [
{'params': model.features.parameters(), 'lr': 1e-4},
{'params': model.classifier.parameters(), 'lr': 1e-3}
]
optimizer = torchium.optimizers.Lion(param_groups)
Factory Functions
# Create optimizer by name
optimizer = torchium.create_optimizer(
'sam',
model.parameters(),
lr=1e-3
)
# List all available optimizers
available = torchium.get_available_optimizers()
print(f"Available optimizers: {len(available)}")
Performance Comparison
Based on our comprehensive benchmarks:
Optimizer Selection Guide
- For General Purpose Training:
SAM: Best generalization, flatter minima
AdaBelief: Stable, good for most tasks
Lion: Memory efficient, good performance
- For Computer Vision:
Ranger: Excellent for vision tasks
Lookahead: Good for large models
SAM: Better generalization
- For Natural Language Processing:
LAMB: Excellent for large batch training
NovoGrad: Good for transformer models
AdamW: Reliable baseline
- For Memory-Constrained Environments:
Lion: Lowest memory usage
SGD: Classic, minimal memory
HeavyBall: Good momentum alternative
- For Second-Order Optimization:
LBFGS: Fast convergence for well-conditioned problems
Shampoo: Excellent for large models
AdaHessian: Adaptive second-order method
- For Experimental/Research:
CMA-ES: Global optimization
DifferentialEvolution: Robust optimization
ParticleSwarmOptimization: Nature-inspired