Optimizers API

Torchium provides 65+ advanced optimizers organized into several categories, extending PyTorch’s native optimizer collection with state-of-the-art algorithms from recent research.

Second-Order Optimizers

These optimizers use second-order information for faster convergence.

LBFGS

class torchium.optimizers.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)[source]

Bases: Optimizer

Limited-memory Broyden-Fletcher-Goldfarb-Shanno optimizer

__init__(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-07, tolerance_change=1e-09, history_size=100, line_search_fn=None)[source]
step(closure)[source]

Perform a single optimization step

Shampoo

class torchium.optimizers.Shampoo(params, lr=0.03, eps=0.0001, update_freq=100, weight_decay=0)[source]

Bases: Optimizer

Shampoo optimizer for deep learning

__init__(params, lr=0.03, eps=0.0001, update_freq=100, weight_decay=0)[source]
step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

AdaHessian

class torchium.optimizers.AdaHessian(params, lr=0.15, betas=(0.9, 0.999), eps=0.0001, weight_decay=0, hessian_power=1, update_each=1, n_samples=1, avg_conv_kernel=False)[source]

Bases: Optimizer

AdaHessian optimizer using second-order information

__init__(params, lr=0.15, betas=(0.9, 0.999), eps=0.0001, weight_decay=0, hessian_power=1, update_each=1, n_samples=1, avg_conv_kernel=False)[source]
get_trace(gradsH)[source]

Compute trace of Hessian

step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

KFAC

class torchium.optimizers.KFAC(params, lr=0.001, momentum=0.9, weight_decay=0, damping=0.001, TCov=10, TInv=100, batch_averaged=True)[source]

Bases: Optimizer

K-FAC (Kronecker-Factored Approximate Curvature) optimizer

__init__(params, lr=0.001, momentum=0.9, weight_decay=0, damping=0.001, TCov=10, TInv=100, batch_averaged=True)[source]
step(closure=None)[source]

Perform a single optimization step

NaturalGradient

class torchium.optimizers.NaturalGradient(params, lr=0.01, alpha=0.95, eps=0.0001)[source]

Bases: Optimizer

Natural Gradient optimizer

__init__(params, lr=0.01, alpha=0.95, eps=0.0001)[source]
step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Meta-Optimizers

Sharpness-Aware Minimization (SAM) family and gradient manipulation methods.

SAM

class torchium.optimizers.SAM(params, base_optimizer, rho=0.05, adaptive=False, **kwargs)[source]

Bases: Optimizer

Sharpness-Aware Minimization optimizer

__init__(params, base_optimizer, rho=0.05, adaptive=False, **kwargs)[source]
first_step(zero_grad=False)[source]

First step: compute adversarial parameters

second_step(zero_grad=False)[source]

Second step: update parameters using base optimizer

step(closure=None)[source]

Combined step function

GSAM

class torchium.optimizers.GSAM(params, base_optimizer, rho=0.05, alpha=0.4, adaptive=False, **kwargs)[source]

Bases: Optimizer

Gradient-based Sharpness-Aware Minimization

__init__(params, base_optimizer, rho=0.05, alpha=0.4, adaptive=False, **kwargs)[source]
first_step(zero_grad=False)[source]
second_step(zero_grad=False)[source]
step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

ASAM

class torchium.optimizers.ASAM(params, base_optimizer, rho=0.5, eta=0.01, adaptive=True, **kwargs)[source]

Bases: Optimizer

Adaptive Sharpness-Aware Minimization

__init__(params, base_optimizer, rho=0.5, eta=0.01, adaptive=True, **kwargs)[source]
first_step(zero_grad=False)[source]
second_step(zero_grad=False)[source]
step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

LookSAM

class torchium.optimizers.LookSAM(params, base_optimizer, k=5, alpha=0.5, rho=0.05, **kwargs)[source]

Bases: Optimizer

Look-ahead SAM optimizer

__init__(params, base_optimizer, k=5, alpha=0.5, rho=0.05, **kwargs)[source]
lookahead_step()[source]

Perform lookahead update

step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

WSAM

class torchium.optimizers.WSAM(params, base_optimizer, rho=0.05, tau=1.0, **kwargs)[source]

Bases: Optimizer

Weighted Sharpness-Aware Minimization

__init__(params, base_optimizer, rho=0.05, tau=1.0, **kwargs)[source]
first_step(zero_grad=False)[source]
second_step(zero_grad=False)[source]
step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

GradientCentralization

class torchium.optimizers.GradientCentralization(params, base_optimizer, use_gc=True, gc_conv_only=False, **kwargs)[source]

Bases: Optimizer

Gradient Centralization wrapper

__init__(params, base_optimizer, use_gc=True, gc_conv_only=False, **kwargs)[source]
centralize_gradient(p)[source]

Apply gradient centralization

step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

PCGrad

class torchium.optimizers.PCGrad(params, base_optimizer, num_tasks, **kwargs)[source]

Bases: Optimizer

Projecting Conflicting Gradients

__init__(params, base_optimizer, num_tasks, **kwargs)[source]
project_conflicting_gradients(grads)[source]

Project conflicting gradients

step(closure=None, task_losses=None)[source]
Parameters:
  • closure – closure function

  • task_losses – list of task-specific loss functions

GradNorm

class torchium.optimizers.GradNorm(params, base_optimizer, num_tasks, alpha=1.5, **kwargs)[source]

Bases: Optimizer

Gradient Normalization for multi-task learning

__init__(params, base_optimizer, num_tasks, alpha=1.5, **kwargs)[source]
step(closure=None, task_losses=None, task_grads=None)[source]
Parameters:
  • closure – closure function

  • task_losses – tensor of task losses

  • task_grads – list of task gradient norms

Experimental Optimizers

Evolutionary and nature-inspired optimization algorithms.

CMA-ES

class torchium.optimizers.CMAES(params, sigma=0.1, popsize=None, seed=None)[source]

Bases: Optimizer

Covariance Matrix Adaptation Evolution Strategy

__init__(params, sigma=0.1, popsize=None, seed=None)[source]
step(closure)[source]

Perform one generation of CMA-ES

DifferentialEvolution

class torchium.optimizers.DifferentialEvolution(params, popsize=50, mutation=0.8, recombination=0.7, seed=None, bounds=None)[source]

Bases: Optimizer

Differential Evolution optimizer

__init__(params, popsize=50, mutation=0.8, recombination=0.7, seed=None, bounds=None)[source]
step(closure)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

ParticleSwarmOptimization

class torchium.optimizers.ParticleSwarmOptimization(params, popsize=50, inertia=0.9, cognitive=2.0, social=2.0, seed=None, bounds=None)[source]

Bases: Optimizer

Particle Swarm Optimization

__init__(params, popsize=50, inertia=0.9, cognitive=2.0, social=2.0, seed=None, bounds=None)[source]
step(closure)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

QuantumAnnealing

class torchium.optimizers.QuantumAnnealing(params, temperature=1.0, cooling_rate=0.99, min_temperature=0.01, seed=None)[source]

Bases: Optimizer

Simplified Quantum Annealing optimizer

__init__(params, temperature=1.0, cooling_rate=0.99, min_temperature=0.01, seed=None)[source]
step(closure)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

GeneticAlgorithm

class torchium.optimizers.GeneticAlgorithm(params, popsize=50, mutation_rate=0.1, crossover_rate=0.8, elite_ratio=0.1, seed=None, bounds=None)[source]

Bases: Optimizer

Genetic Algorithm optimizer

__init__(params, popsize=50, mutation_rate=0.1, crossover_rate=0.8, elite_ratio=0.1, seed=None, bounds=None)[source]
step(closure)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Adaptive Optimizers

Adam Variants

class torchium.optimizers.Adam(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, **kwargs)[source]

Bases: Adam

Adam optimizer with enhanced features.

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, **kwargs)[source]
class torchium.optimizers.AdamW(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, **kwargs)[source]

Bases: AdamW

AdamW optimizer with decoupled weight decay.

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01, amsgrad: bool = False, **kwargs)[source]
class torchium.optimizers.RAdam(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, **kwargs)[source]

Bases: Optimizer

RAdam: Rectified Adam optimizer.

Reference: https://arxiv.org/abs/1908.03265

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

class torchium.optimizers.AdaBelief(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, **kwargs)[source]

Bases: Optimizer

AdaBelief: Adapting Step-sizes by the Belief in Observed Gradients.

Reference: https://arxiv.org/abs/2010.07468

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, amsgrad: bool = False, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

class torchium.optimizers.AdaBound(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), final_lr: float = 0.1, gamma: float = 0.001, eps: float = 1e-08, weight_decay: float = 0, amsbound: bool = False, **kwargs)[source]

Bases: Optimizer

AdaBound: Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Reference: https://openreview.net/forum?id=Bkg3g2R9FX

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), final_lr: float = 0.1, gamma: float = 0.001, eps: float = 1e-08, weight_decay: float = 0, amsbound: bool = False, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

class torchium.optimizers.AdaHessian(params, lr=0.15, betas=(0.9, 0.999), eps=0.0001, weight_decay=0, hessian_power=1, update_each=1, n_samples=1, avg_conv_kernel=False)[source]

Bases: Optimizer

AdaHessian optimizer using second-order information

__init__(params, lr=0.15, betas=(0.9, 0.999), eps=0.0001, weight_decay=0, hessian_power=1, update_each=1, n_samples=1, avg_conv_kernel=False)[source]
get_trace(gradsH)[source]

Compute trace of Hessian

step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

class torchium.optimizers.AdamP(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, delta: float = 0.1, wd_ratio: float = 0.1, nesterov: bool = False, **kwargs)[source]

Bases: Optimizer

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights.

Reference: https://arxiv.org/abs/2006.08217

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, delta: float = 0.1, wd_ratio: float = 0.1, nesterov: bool = False, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

Adagrad Variants

class torchium.optimizers.Adagrad(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, lr_decay: float = 0, weight_decay: float = 0, initial_accumulator_value: float = 0, eps: float = 1e-10, **kwargs)[source]

Bases: Adagrad

Adagrad optimizer with enhanced features.

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, lr_decay: float = 0, weight_decay: float = 0, initial_accumulator_value: float = 0, eps: float = 1e-10, **kwargs)[source]
class torchium.optimizers.Adadelta(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, rho: float = 0.9, eps: float = 1e-06, weight_decay: float = 0, **kwargs)[source]

Bases: Adadelta

Adadelta optimizer with enhanced features.

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, rho: float = 0.9, eps: float = 1e-06, weight_decay: float = 0, **kwargs)[source]
class torchium.optimizers.AdaFactor(params: List[Tensor] | Dict[str, Any], lr: float | None = None, eps2: float = 1e-30, cliping_threshold: float = 1.0, decay_rate: float = -0.8, beta1: float | None = None, weight_decay: float = 0.0, scale_parameter: bool = True, relative_step: bool = True, **kwargs)[source]

Bases: Optimizer

AdaFactor: Adaptive Learning Rates with Sublinear Memory Cost.

Reference: https://arxiv.org/abs/1804.04235

__init__(params: List[Tensor] | Dict[str, Any], lr: float | None = None, eps2: float = 1e-30, cliping_threshold: float = 1.0, decay_rate: float = -0.8, beta1: float | None = None, weight_decay: float = 0.0, scale_parameter: bool = True, relative_step: bool = True, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

RMSprop Variants

class torchium.optimizers.RMSprop(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0, centered: bool = False, **kwargs)[source]

Bases: RMSprop

RMSprop optimizer with enhanced features.

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, alpha: float = 0.99, eps: float = 1e-08, weight_decay: float = 0, momentum: float = 0, centered: bool = False, **kwargs)[source]
class torchium.optimizers.Yogi(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, betas: tuple = (0.9, 0.999), eps: float = 0.001, initial_accumulator: float = 1e-06, weight_decay: float = 0, **kwargs)[source]

Bases: Optimizer

Yogi: Adaptive Methods for Nonconvex Optimization.

Reference: https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, betas: tuple = (0.9, 0.999), eps: float = 0.001, initial_accumulator: float = 1e-06, weight_decay: float = 0, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

Momentum-Based Optimizers

SGD Variants

class torchium.optimizers.SGD(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False, **kwargs)[source]

Bases: SGD

SGD optimizer with enhanced features.

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0, dampening: float = 0, weight_decay: float = 0, nesterov: bool = False, **kwargs)[source]
class torchium.optimizers.HeavyBall(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0.9, weight_decay: float = 0, **kwargs)[source]

Bases: Optimizer

Heavy Ball momentum optimizer.

Reference: Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0.9, weight_decay: float = 0, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

Specialized Optimizers

Computer Vision

class torchium.optimizers.Ranger(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, alpha: float = 0.5, k: int = 6, n_sma_threshhold: int = 5, betas: tuple = (0.95, 0.999), eps: float = 1e-05, weight_decay: float = 0, **kwargs)[source]

Bases: Optimizer

Ranger: A synergistic optimizer combining RAdam and LookAhead.

Reference: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, alpha: float = 0.5, k: int = 6, n_sma_threshhold: int = 5, betas: tuple = (0.95, 0.999), eps: float = 1e-05, weight_decay: float = 0, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

NLP Optimizers

class torchium.optimizers.LAMB(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.01, clamp_value: float = 10.0, **kwargs)[source]

Bases: Optimizer

LAMB: Large Batch Optimization for Deep Learning.

Reference: https://arxiv.org/abs/1904.00962

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-06, weight_decay: float = 0.01, clamp_value: float = 10.0, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

class torchium.optimizers.NovoGrad(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.95, 0.98), eps: float = 1e-08, weight_decay: float = 0, grad_averaging: bool = True, amsgrad: bool = False, **kwargs)[source]

Bases: Optimizer

NovoGrad: Stochastic Gradient Methods with Layer-wise Adaptive Moments.

Reference: https://arxiv.org/abs/1905.11286

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.95, 0.98), eps: float = 1e-08, weight_decay: float = 0, grad_averaging: bool = True, amsgrad: bool = False, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

Sparse Data

class torchium.optimizers.SparseAdam(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, **kwargs)[source]

Bases: SparseAdam

SparseAdam optimizer with enhanced features.

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, **kwargs)[source]
class torchium.optimizers.SM3(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0.0, eps: float = 1e-08, **kwargs)[source]

Bases: Optimizer

SM3: Memory-Efficient Adaptive Optimization.

Reference: https://arxiv.org/abs/1901.11150

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.001, momentum: float = 0.0, eps: float = 1e-08, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

class torchium.optimizers.FTRL(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, lr_power: float = -0.5, l1_regularization_strength: float = 0.0, l2_regularization_strength: float = 0.0, initial_accumulator_value: float = 0.1, **kwargs)[source]

Bases: Optimizer

FTRL: Follow The Regularized Leader optimizer.

Reference: https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, lr_power: float = -0.5, l1_regularization_strength: float = 0.0, l2_regularization_strength: float = 0.0, initial_accumulator_value: float = 0.1, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

Distributed Training

class torchium.optimizers.LARS(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, momentum: float = 0.9, weight_decay: float = 0.0001, trust_coefficient: float = 0.001, eps: float = 1e-08, **kwargs)[source]

Bases: Optimizer

LARS: Layer-wise Adaptive Rate Scaling.

Reference: https://arxiv.org/abs/1708.03888

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 1.0, momentum: float = 0.9, weight_decay: float = 0.0001, trust_coefficient: float = 0.001, eps: float = 1e-08, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

General Purpose

class torchium.optimizers.Lion(params: List[Tensor] | Dict[str, Any], lr: float = 0.0001, betas: tuple = (0.9, 0.99), weight_decay: float = 0.0, **kwargs)[source]

Bases: Optimizer

Lion: Symbolic Discovery of Optimization Algorithms.

Reference: https://arxiv.org/abs/2302.06675

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.0001, betas: tuple = (0.9, 0.99), weight_decay: float = 0.0, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

class torchium.optimizers.MADGRAD(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-06, **kwargs)[source]

Bases: Optimizer

MADGRAD: A Momentumized, Adaptive, Dual Averaged Gradient Method.

Reference: https://arxiv.org/abs/2101.11075

__init__(params: List[Tensor] | Dict[str, Any], lr: float = 0.01, momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-06, **kwargs)[source]
step(closure: Callable | None = None)[source]

Performs a single optimization step.

class torchium.optimizers.Apollo(params, lr=0.01, beta=0.9, eps=0.0001, init_lr=0.01, warmup=0, **kwargs)[source]

Bases: Optimizer

Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method.

__init__(params, lr=0.01, beta=0.9, eps=0.0001, init_lr=0.01, warmup=0, **kwargs)[source]
step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

PyTorch Native Optimizers

For completeness, Torchium also includes all PyTorch native optimizers:

class torchium.optimizers.NAdam(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, momentum_decay: float = 0.004, decoupled_weight_decay: bool = False, *, foreach: bool | None = None, maximize: bool = False, capturable: bool = False, differentiable: bool = False)[source]

Bases: Optimizer

Implements NAdam algorithm.

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma_t \text{ (lr)}, \: \beta_1,\beta_2 \text{ (betas)}, \: \theta_0 \text{ (params)}, \: f(\theta) \text{ (objective)} \\ &\hspace{13mm} \: \lambda \text{ (weight decay)}, \:\psi \text{ (momentum decay)} \\ &\hspace{13mm} \: \textit{decoupled\_weight\_decay}, \:\textit{maximize} \\ &\textbf{initialize} : m_0 \leftarrow 0 \text{ ( first moment)}, v_0 \leftarrow 0 \text{ ( second moment)} \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}\textbf{if} \: \textit{maximize}: \\ &\hspace{10mm}g_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \theta_t \leftarrow \theta_{t-1} \\ &\hspace{5mm} \textbf{if} \: \lambda \neq 0 \\ &\hspace{10mm}\textbf{if} \: \textit{decoupled\_weight\_decay} \\ &\hspace{15mm} \theta_t \leftarrow \theta_{t-1} - \gamma \lambda \theta_{t-1} \\ &\hspace{10mm}\textbf{else} \\ &\hspace{15mm} g_t \leftarrow g_t + \lambda \theta_{t-1} \\ &\hspace{5mm} \mu_t \leftarrow \beta_1 \big(1 - \frac{1}{2} 0.96^{t \psi} \big) \\ &\hspace{5mm} \mu_{t+1} \leftarrow \beta_1 \big(1 - \frac{1}{2} 0.96^{(t+1)\psi}\big)\\ &\hspace{5mm}m_t \leftarrow \beta_1 m_{t-1} + (1 - \beta_1) g_t \\ &\hspace{5mm}v_t \leftarrow \beta_2 v_{t-1} + (1-\beta_2) g^2_t \\ &\hspace{5mm}\widehat{m_t} \leftarrow \mu_{t+1} m_t/(1-\prod_{i=1}^{t+1}\mu_i)\\[-1.ex] & \hspace{11mm} + (1-\mu_t) g_t /(1-\prod_{i=1}^{t} \mu_{i}) \\ &\hspace{5mm}\widehat{v_t} \leftarrow v_t/\big(1-\beta_2^t \big) \\ &\hspace{5mm}\theta_t \leftarrow \theta_t - \gamma \widehat{m_t}/ \big(\sqrt{\widehat{v_t}} + \epsilon \big) \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

For further details regarding the algorithm we refer to Incorporating Nesterov Momentum into Adam.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, Tensor, optional) – learning rate (default: 2e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • momentum_decay (float, optional) – momentum momentum_decay (default: 4e-3)

  • decoupled_weight_decay (bool, optional) – whether to decouple the weight decay as in AdamW to obtain NAdamW. If True, the algorithm does not accumulate weight decay in the momentum nor variance. (default: False)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • capturable (bool, optional) – whether this instance is safe to capture in a graph, whether for CUDA graphs or for torch.compile support. Tensors are only capturable when on supported accelerators. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

__init__(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.002, betas: tuple[float, float] = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0, momentum_decay: float = 0.004, decoupled_weight_decay: bool = False, *, foreach: bool | None = None, maximize: bool = False, capturable: bool = False, differentiable: bool = False)[source]
step(closure=None)[source]

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

class torchium.optimizers.Rprop(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, etas: tuple[float, float] = (0.5, 1.2), step_sizes: tuple[float, float] = (1e-06, 50), *, capturable: bool = False, foreach: bool | None = None, maximize: bool = False, differentiable: bool = False)[source]

Bases: Optimizer

Implements the resilient backpropagation algorithm.

\[\begin{split}\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \theta_0 \in \mathbf{R}^d \text{ (params)},f(\theta) \text{ (objective)}, \\ &\hspace{13mm} \eta_{+/-} \text{ (etaplus, etaminus)}, \Gamma_{max/min} \text{ (step sizes)} \\ &\textbf{initialize} : g^0_{prev} \leftarrow 0, \: \eta_0 \leftarrow \text{lr (learning rate)} \\ &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}g_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm} \textbf{for} \text{ } i = 0, 1, \ldots, d-1 \: \mathbf{do} \\ &\hspace{10mm} \textbf{if} \: g^i_{prev} g^i_t > 0 \\ &\hspace{15mm} \eta^i_t \leftarrow \mathrm{min}(\eta^i_{t-1} \eta_{+}, \Gamma_{max}) \\ &\hspace{10mm} \textbf{else if} \: g^i_{prev} g^i_t < 0 \\ &\hspace{15mm} \eta^i_t \leftarrow \mathrm{max}(\eta^i_{t-1} \eta_{-}, \Gamma_{min}) \\ &\hspace{15mm} g^i_t \leftarrow 0 \\ &\hspace{10mm} \textbf{else} \: \\ &\hspace{15mm} \eta^i_t \leftarrow \eta^i_{t-1} \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1}- \eta_t \mathrm{sign}(g_t) \\ &\hspace{5mm}g_{prev} \leftarrow g_t \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}\end{split}\]

For further details regarding the algorithm we refer to the paper A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm.

Parameters:
  • params (iterable) – iterable of parameters or named_parameters to optimize or iterable of dicts defining parameter groups. When using named_parameters, all parameters in all groups should be named

  • lr (float, optional) – learning rate (default: 1e-2)

  • etas (Tuple[float, float], optional) – pair of (etaminus, etaplus), that are multiplicative increase and decrease factors (default: (0.5, 1.2))

  • step_sizes (Tuple[float, float], optional) – a pair of minimal and maximal allowed step sizes (default: (1e-6, 50))

  • capturable (bool, optional) – whether this instance is safe to capture in a graph, whether for CUDA graphs or for torch.compile support. Tensors are only capturable when on supported accelerators. Passing True can impair ungraphed performance, so if you don’t intend to graph capture this instance, leave it False (default: False)

  • foreach (bool, optional) – whether foreach implementation of optimizer is used. If unspecified by the user (so foreach is None), we will try to use foreach over the for-loop implementation on CUDA, since it is usually significantly more performant. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. If memory is prohibitive, batch fewer parameters through the optimizer at a time or switch this flag to False (default: None)

  • maximize (bool, optional) – maximize the objective with respect to the params, instead of minimizing (default: False)

  • differentiable (bool, optional) – whether autograd should occur through the optimizer step in training. Otherwise, the step() function runs in a torch.no_grad() context. Setting to True can impair performance, so leave it False if you don’t intend to run autograd through this instance (default: False)

__init__(params: Iterable[Tensor] | Iterable[dict[str, Any]] | Iterable[tuple[str, Tensor]], lr: float | Tensor = 0.01, etas: tuple[float, float] = (0.5, 1.2), step_sizes: tuple[float, float] = (1e-06, 50), *, capturable: bool = False, foreach: bool | None = None, maximize: bool = False, differentiable: bool = False)[source]
step(closure=None)[source]

Perform a single optimization step.

Parameters:

closure (Callable, optional) – A closure that reevaluates the model and returns the loss.

Usage Examples

Basic Usage

import torch
import torch.nn as nn
import torchium

model = nn.Linear(10, 1)

# Use SAM optimizer for better generalization
optimizer = torchium.optimizers.SAM(
    model.parameters(),
    lr=1e-3,
    rho=0.05
)

Advanced Usage

# Different learning rates for different layers
param_groups = [
    {'params': model.features.parameters(), 'lr': 1e-4},
    {'params': model.classifier.parameters(), 'lr': 1e-3}
]

optimizer = torchium.optimizers.Lion(param_groups)

Factory Functions

# Create optimizer by name
optimizer = torchium.create_optimizer(
    'sam',
    model.parameters(),
    lr=1e-3
)

# List all available optimizers
available = torchium.get_available_optimizers()
print(f"Available optimizers: {len(available)}")

Performance Comparison

Based on our comprehensive benchmarks:

Optimizer Selection Guide

For General Purpose Training:
  • SAM: Best generalization, flatter minima

  • AdaBelief: Stable, good for most tasks

  • Lion: Memory efficient, good performance

For Computer Vision:
  • Ranger: Excellent for vision tasks

  • Lookahead: Good for large models

  • SAM: Better generalization

For Natural Language Processing:
  • LAMB: Excellent for large batch training

  • NovoGrad: Good for transformer models

  • AdamW: Reliable baseline

For Memory-Constrained Environments:
  • Lion: Lowest memory usage

  • SGD: Classic, minimal memory

  • HeavyBall: Good momentum alternative

For Second-Order Optimization:
  • LBFGS: Fast convergence for well-conditioned problems

  • Shampoo: Excellent for large models

  • AdaHessian: Adaptive second-order method

For Experimental/Research:
  • CMA-ES: Global optimization

  • DifferentialEvolution: Robust optimization

  • ParticleSwarmOptimization: Nature-inspired