Optimizers API
Torchium provides 65+ advanced optimizers organized into several categories, extending PyTorch’s native optimizer collection with state-of-the-art algorithms from recent research.
Second-Order Optimizers
These optimizers use second-order information for faster convergence.
LBFGS
- class torchium.optimizers.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)[source]
Bases:
OptimizerLimited-memory Broyden-Fletcher-Goldfarb-Shanno optimizer
Shampoo
AdaHessian
- class torchium.optimizers.AdaHessian(params, lr=0.15, betas=(0.9, 0.999), eps=0.0001, weight_decay=0, hessian_power=1, update_each=1, n_samples=1, avg_conv_kernel=False)[source]
Bases:
OptimizerAdaHessian optimizer using second-order information
KFAC
- class torchium.optimizers.KFAC(params, lr=0.001, momentum=0.9, weight_decay=0, damping=0.001, TCov=10, TInv=100, batch_averaged=True, eps=1e-08)[source]
Bases:
OptimizerK-FAC (Kronecker-Factored Approximate Curvature) optimizer
NaturalGradient
Meta-Optimizers
Sharpness-Aware Minimization (SAM) family and gradient manipulation methods.
SAM
GSAM
ASAM
LookSAM
WSAM
GradientCentralization
PCGrad
GradNorm
Experimental Optimizers
Evolutionary and nature-inspired optimization algorithms.
CMA-ES
DifferentialEvolution
ParticleSwarmOptimization
QuantumAnnealing
GeneticAlgorithm
- class torchium.optimizers.GeneticAlgorithm(params, popsize=50, mutation_rate=0.1, crossover_rate=0.8, elite_ratio=0.1, seed=None, bounds=None)[source]
Bases:
OptimizerGenetic Algorithm optimizer
Adaptive Optimizers
Adam Variants
- class torchium.optimizers.Adam(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, **kwargs)[source]
Bases:
AdamAdam optimizer with enhanced features.
- class torchium.optimizers.AdamW(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, **kwargs)[source]
Bases:
AdamWAdamW optimizer with decoupled weight decay.
- class torchium.optimizers.RAdam(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, **kwargs)[source]
Bases:
OptimizerRAdam: Rectified Adam optimizer.
Reference: https://arxiv.org/abs/1908.03265
- class torchium.optimizers.AdaBelief(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, **kwargs)[source]
Bases:
OptimizerAdaBelief: Adapting Step-sizes by the Belief in Observed Gradients.
Reference: https://arxiv.org/abs/2010.07468
- class torchium.optimizers.AdaBound(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), final_lr: float = 0.1, gamma: float = 0.001, eps: float = 1e-08, weight_decay: float = 0, amsbound: bool = False, **kwargs)[source]
Bases:
OptimizerAdaBound: Adaptive Gradient Methods with Dynamic Bound of Learning Rate.
Reference: https://openreview.net/forum?id=Bkg3g2R9FX
- class torchium.optimizers.AdamP(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, delta: float = 0.1, wd_ratio: float = 0.1, nesterov: bool = False, **kwargs)[source]
Bases:
OptimizerAdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights.
Reference: https://arxiv.org/abs/2006.08217
Adagrad Variants
- class torchium.optimizers.Adagrad(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, lr_decay: float = 0, weight_decay: float = 0, initial_accumulator_value: float = 0, eps: float = 1e-10, **kwargs)[source]
Bases:
AdagradAdagrad optimizer with enhanced features.
- class torchium.optimizers.Adadelta(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, rho: float = 0.9, eps: float = 1e-06, weight_decay: float = 0, **kwargs)[source]
Bases:
AdadeltaAdadelta optimizer with enhanced features.
- class torchium.optimizers.AdaFactor(params: List[Tensor] | Dict[str, Any], lr: float | None = None, eps2: float = 1e-30, cliping_threshold: float = 1.0, decay_rate: float = -0.8, beta1: float | None = None, weight_decay: float = 0.0, scale_parameter: bool = True, relative_step: bool = True, **kwargs)[source]
Bases:
OptimizerAdaFactor: Adaptive Learning Rates with Sublinear Memory Cost.
Reference: https://arxiv.org/abs/1804.04235
RMSprop Variants
- class torchium.optimizers.RMSprop(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0, centered: bool = False, **kwargs)[source]
Bases:
RMSpropRMSprop optimizer with enhanced features.
- class torchium.optimizers.Yogi(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, betas: tuple = (0.9, 0.999), eps: float = 0.001, initial_accumulator: float = 1e-06, weight_decay: float = 0, **kwargs)[source]
Bases:
OptimizerYogi: Adaptive Methods for Nonconvex Optimization.
Reference: https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization
Momentum-Based Optimizers
SGD Variants
- class torchium.optimizers.SGD(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False, **kwargs)[source]
Bases:
SGDSGD optimizer with enhanced features.
- class torchium.optimizers.HeavyBall(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0.9, weight_decay: float = 0, **kwargs)[source]
Bases:
OptimizerHeavy Ball momentum optimizer.
Reference: Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.
Specialized Optimizers
Computer Vision
- class torchium.optimizers.Ranger(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, alpha: float = 0.5, k: int = 6, n_sma_threshhold: int = 5, betas: tuple = (0.95, 0.999), eps: float = 1e-05, weight_decay: float = 0, **kwargs)[source]
Bases:
OptimizerRanger: A synergistic optimizer combining RAdam and LookAhead.
Reference: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer
NLP Optimizers
- class torchium.optimizers.LAMB(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.01, clamp_value: float = 10.0, **kwargs)[source]
Bases:
OptimizerLAMB: Large Batch Optimization for Deep Learning.
Reference: https://arxiv.org/abs/1904.00962
- class torchium.optimizers.NovoGrad(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.95, 0.98), eps: float = 1e-08, weight_decay: float = 0, grad_averaging: bool = True, amsgrad: bool = False, **kwargs)[source]
Bases:
OptimizerNovoGrad: Stochastic Gradient Methods with Layer-wise Adaptive Moments.
Reference: https://arxiv.org/abs/1905.11286
Sparse Data
- class torchium.optimizers.SparseAdam(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, **kwargs)[source]
Bases:
SparseAdamSparseAdam optimizer with enhanced features.
- class torchium.optimizers.SM3(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0.0, eps: float = 1e-08, **kwargs)[source]
Bases:
OptimizerSM3: Memory-Efficient Adaptive Optimization.
Reference: https://arxiv.org/abs/1901.11150
- class torchium.optimizers.FTRL(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, lr_power: float = -0.5, l1_regularization_strength: float = 0.0, l2_regularization_strength: float = 0.0, initial_accumulator_value: float = 0.1, **kwargs)[source]
Bases:
OptimizerFTRL: Follow The Regularized Leader optimizer.
Reference: https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf
Distributed Training
- class torchium.optimizers.LARS(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, momentum: float = 0.9, weight_decay: float = 0.0001, trust_coefficient: float = 0.001, eps: float = 1e-08, **kwargs)[source]
Bases:
OptimizerLARS: Layer-wise Adaptive Rate Scaling.
Reference: https://arxiv.org/abs/1708.03888
General Purpose
- class torchium.optimizers.Lion(params: List[Tensor] | Dict[str, Any], lr: float = 0.0001, betas: tuple = (0.9, 0.99), weight_decay: float = 0.0, **kwargs)[source]
Bases:
OptimizerLion: Symbolic Discovery of Optimization Algorithms.
Reference: https://arxiv.org/abs/2302.06675
- class torchium.optimizers.MADGRAD(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-06, **kwargs)[source]
Bases:
OptimizerMADGRAD: A Momentumized, Adaptive, Dual Averaged Gradient Method.
Reference: https://arxiv.org/abs/2101.11075
PyTorch Native Optimizers
For completeness, Torchium also includes all PyTorch native optimizers:
- class torchium.optimizers.NAdam(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, momentum_decay: float = 0.004, decoupled_weight_decay: bool = False, *, foreach: bool | None = None, maximize: bool = False, capturable: bool = False, differentiable: bool = False)[source]
Bases:
OptimizerImplements NAdam algorithm.
\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma_t \text{ (lr)}, \: \beta_1,\beta_2 \text{ (betas)}, \: \theta_0 \text{ (params)}, \: f(\theta) \text{ (objective)} \\ &\hspace{13mm} \: \lambda \text{ (weight decay)}, \:\psi \text{ (momentum decay)} \\ &\hspace{13mm} \: \textit{decoupled\_weight\_decay}, \:\textit{maximize} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, v_0 \leftarrow 0 \text{ ( second moment)} \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}\textbf{if} \: \textit{maximize}: \\ &\hspace{10mm}g_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \theta_t \leftarrow \theta_{t-1} \\ &\hspace{5mm} \textbf{if} \: \lambda \neq 0 \\ &\hspace{10mm}\textbf{if} \: \textit{decoupled\_weight\_decay} \\ &\hspace{15mm} \theta_t \leftarrow \theta_{t-1} - \gamma \lambda \theta_{t-1} \\ &\hspace{10mm}\textbf{else} \\ &\hspace{15mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm} \mu_t \leftarrow \beta_1 \big(1 - \frac{1}{2} 0.96^{t \psi} \big) \\ &\hspace{5mm} \mu_{t+1} \leftarrow \beta_1 \big(1 - \frac{1}{2} 0.96^{(t+1)\psi}\big)\\ &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ &\hspace{5mm}\widehat{m_t} \leftarrow \mu_{t+1} m_t/(1-\prod_{i=1}^{t+1}\mu_i)\\[-1.ex] & \hspace{11mm} + (1-\mu_t) g_t /(1-\prod_{i=1}^{t} \mu_{i}) \\ &\hspace{5mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t}/ \big(\sqrt{\widehat{v_t}} + \epsilon \big) \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]For further details regarding the algorithm we refer to Incorporating Nesterov Momentum into Adam.
- Parameters:
params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named
lr (float, Tensor, optional) – learning rate (default: 2e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
momentum_decay (float, optional) – momentum momentum_decay (default: 4e-3)
decoupled_weight_decay (bool, optional) – whether to decouple the weight decay as in AdamW to obtain NAdamW. If True, the algorithm does not accumulate weight decay in the momentum nor variance. (default: False)
foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)
maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)
capturable (bool, optional) – whether this instance is safe to capture in a graph, whether for CUDA graphs or for torch.compile support. Tensors are only capturable when on supported accelerators. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)
differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)
- __init__(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, momentum_decay: float = 0.004, decoupled_weight_decay: bool = False, *, foreach: bool | None = None, maximize: bool = False, capturable: bool = False, differentiable: bool = False)[source]
- class torchium.optimizers.Rprop(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, etas: tuple[float, float] = (0.5, 1.2), step_sizes: tuple[float, float] = (1e-06, 50), *, capturable: bool = False, foreach: bool | None = None, maximize: bool = False, differentiable: bool = False)[source]
Bases:
OptimizerImplements the resilient backpropagation algorithm.
\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \theta_0 \in \mathbf{R}^d \text{ (params)},f(\theta) \text{ (objective)}, \\ &\hspace{13mm} \eta_{+/-} \text{ (etaplus, etaminus)}, \Gamma_{max/min} \text{ (step sizes)} \\ &\textbf{initialize} : g^0_{prev} \leftarrow 0, \: \eta_0 \leftarrow \text{lr (learning rate)} \\ &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \textbf{for} \text{ } i = 0, 1, \ldots, d-1 \: \mathbf{do} \\ &\hspace{10mm} \textbf{if} \: g^i_{prev} g^i_t > 0 \\ &\hspace{15mm} \eta^i_t \leftarrow \mathrm{min}(\eta^i_{t-1} \eta_{+}, \Gamma_{max}) \\ &\hspace{10mm} \textbf{else if} \: g^i_{prev} g^i_t < 0 \\ &\hspace{15mm} \eta^i_t \leftarrow \mathrm{max}(\eta^i_{t-1} \eta_{-}, \Gamma_{min}) \\ &\hspace{15mm} g^i_t \leftarrow 0 \\ &\hspace{10mm} \textbf{else} \: \\ &\hspace{15mm} \eta^i_t \leftarrow \eta^i_{t-1} \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1}- \eta_t \mathrm{sign}(g_t) \\ &\hspace{5mm}g_{prev} \leftarrow g_t \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]For further details regarding the algorithm we refer to the paper A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm.
- Parameters:
params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named
lr (float, optional) – learning rate (default: 1e-2)
etas (Tuple[float, float], optional) – pair of (etaminus, etaplus), that are multiplicative increase and decrease factors (default: (0.5, 1.2))
step_sizes (Tuple[float, float], optional) – a pair of minimal and maximal allowed step sizes (default: (1e-6, 50))
capturable (bool, optional) – whether this instance is safe to capture in a graph, whether for CUDA graphs or for torch.compile support. Tensors are only capturable when on supported accelerators. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)
foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)
maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)
differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)
- __init__(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, etas: tuple[float, float] = (0.5, 1.2), step_sizes: tuple[float, float] = (1e-06, 50), *, capturable: bool = False, foreach: bool | None = None, maximize: bool = False, differentiable: bool = False)[source]
Usage Examples
Basic Usage
import torch
import torch.nn as nn
import torchium
model = nn.Linear(10, 1)
# Use SAM optimizer for better generalization
optimizer = torchium.optimizers.SAM(
model.parameters(),
lr=1e-3,
rho=0.05
)
Advanced Usage
# Different learning rates for different layers
param_groups = [
{'params': model.features.parameters(), 'lr': 1e-4},
{'params': model.classifier.parameters(), 'lr': 1e-3}
]
optimizer = torchium.optimizers.Lion(param_groups)
Factory Functions
# Create optimizer by name
optimizer = torchium.create_optimizer(
'sam',
model.parameters(),
lr=1e-3
)
# List all available optimizers
available = torchium.get_available_optimizers()
print(f"Available optimizers: {len(available)}")
Performance Comparison
Based on our comprehensive benchmarks:
Optimizer Selection Guide
- For General Purpose Training:
SAM: Best generalization, flatter minima
AdaBelief: Stable, good for most tasks
Lion: Memory efficient, good performance
- For Computer Vision:
Ranger: Excellent for vision tasks
Lookahead: Good for large models
SAM: Better generalization
- For Natural Language Processing:
LAMB: Excellent for large batch training
NovoGrad: Good for transformer models
AdamW: Reliable baseline
- For Memory-Constrained Environments:
Lion: Lowest memory usage
SGD: Classic, minimal memory
HeavyBall: Good momentum alternative
- For Second-Order Optimization:
LBFGS: Fast convergence for well-conditioned problems
Shampoo: Excellent for large models
AdaHessian: Adaptive second-order method
- For Experimental/Research:
CMA-ES: Global optimization
DifferentialEvolution: Robust optimization
ParticleSwarmOptimization: Nature-inspired