slsqp_jax.hessian¶
L-BFGS history container and Hessian / inverse-Hessian-vector products, VARCHEN-style damping, soft / identity resets and the Lagrangian gradient helpers.
slsqp_jax.hessian
L-BFGS Hessian Approximation for SLSQP.
This module implements the Limited-memory BFGS (L-BFGS) algorithm for maintaining a matrix-free approximation to the Hessian of the Lagrangian.
Instead of storing a dense n x n matrix (O(n^2) memory), L-BFGS stores the last k (s, y) pairs and computes Hessian-vector products in O(kn) time using the compact representation (Byrd, Nocedal, Schnabel 1994):
B = B_0 - [B_0 S, Y] @ M^{-1} @ [S^T B_0; Y^T]
where B_0 = diag(diagonal) is the initial Hessian (per-variable scaling)
and M is a small 2k x 2k matrix built from inner products of the stored
vectors. During normal operation diagonal = gamma * ones(n) and this
reduces to the scalar-scaled form. After an SNOPT-style diagonal reset,
diagonal captures per-variable curvature from the discarded history.
VARCHEN-style Powell damping (Lotfi et al., 2020) is applied to each (s, y) pair before storage, damping toward B_0 = diag(diagonal) instead of the full L-BFGS approximation B. This is cheaper (O(n) vs O(k^2 n)), always well-conditioned, and avoids the circular dependency where a badly-conditioned B poisons its own damping.
- class slsqp_jax.hessian.LBFGSHistory[source]¶
Bases:
ModuleL-BFGS history buffer for matrix-free Hessian approximation.
Stores the last k (s, y) pairs in a circular buffer and provides efficient Hessian-vector products via the compact representation.
- Attributes:
s_history: Stored step vectors s_i = x_{i+1} - x_i. y_history: Stored (damped) gradient differences y_i. gamma: Scalar summary of the initial Hessian scaling. diagonal: Per-variable initial Hessian scaling (B_0 = diag(d)).
During normal operation this equals
gamma * ones(n). After an SNOPT-style reset it stores per-variable curvature.count: Number of valid pairs stored (0 to memory size). next_idx: Next write position in the circular buffer. eig_lower: Estimate of lambda_min(H) from VARCHEN Theorem 2. eig_upper: Estimate of lambda_max(H) from VARCHEN Theorem 2.
- s_history: Float[Array, 'memory n']¶
- y_history: Float[Array, 'memory n']¶
- gamma: Float[Array, '']¶
- diagonal: Float[Array, 'n']¶
- count: Int[Array, '']¶
- next_idx: Int[Array, '']¶
- eig_lower: Float[Array, '']¶
- eig_upper: Float[Array, '']¶
- __init__(s_history, y_history, gamma, diagonal, count, next_idx, eig_lower, eig_upper)¶
- Parameters:
s_history (Float[Array, 'memory n'])
y_history (Float[Array, 'memory n'])
gamma (Float[Array, ''])
diagonal (Float[Array, 'n'])
count (Int[Array, ''])
next_idx (Int[Array, ''])
eig_lower (Float[Array, ''])
eig_upper (Float[Array, ''])
- Return type:
None
- slsqp_jax.hessian.lbfgs_init(n, memory)[source]¶
Initialize an empty L-BFGS history buffer.
- Return type:
- Parameters:
- Args:
n: Dimension of the parameter space. memory: Maximum number of (s, y) pairs to store (typically 5-20).
- Returns:
An initialized LBFGSHistory with no stored pairs and gamma=1.
- slsqp_jax.hessian.lbfgs_hvp(history, v)[source]¶
Compute B @ v using the L-BFGS compact representation.
Uses the compact form with diagonal initial Hessian B_0 = diag(d) (Byrd, Nocedal & Schnabel, 1994, Theorem 2.2):
B = B_0 - [B_0 S, Y] @ M^{-1} @ [S^T B_0; Y^T]
- where:
M = [[S^T B_0 S, L], [L^T, -D_sy]] (2k x 2k) L_{ij} = s_i^T y_j for i > j (strictly lower triangular) D_sy = diag(s_i^T y_i) (diagonal)
When no pairs are stored (count=0), this reduces to B = diag(d).
Complexity: O(k^2 n) where k is the number of stored pairs.
- Args:
history: L-BFGS history buffer. v: Vector to multiply by the Hessian approximation.
- Returns:
B @ v, the Hessian-vector product.
- Return type:
Float[Array, 'n']- Parameters:
history (LBFGSHistory)
v (Float[Array, 'n'])
- slsqp_jax.hessian.lbfgs_inverse_hvp(history, v)[source]¶
Compute H @ v = B^{-1} @ v via the L-BFGS two-loop recursion.
Implements Nocedal & Wright Algorithm 7.4 with diagonal initial scaling H_0 = diag(1/diagonal) instead of scalar (1/gamma) I.
Complexity: O(kn) where k is the number of stored pairs.
- Return type:
Float[Array, 'n']- Parameters:
history (LBFGSHistory)
v (Float[Array, 'n'])
- Args:
history: L-BFGS history buffer. v: Vector to multiply by the inverse Hessian approximation.
- Returns:
H @ v = B^{-1} @ v, the inverse Hessian-vector product.
- slsqp_jax.hessian.lbfgs_should_skip(s, y, skip_threshold=1e-08)[source]¶
Return the scalar
should_skipdecision thatlbfgs_appenduses.Exposed so callers can observe the skip decision without having to diff fields on the returned history (which becomes meaningless once the circular buffer saturates, since
countstays atmemoryfrom then on regardless of whether a real append happened).- Return type:
Bool[Array, '']- Parameters:
s (Float[Array, 'n'])
y (Float[Array, 'n'])
skip_threshold (float)
- slsqp_jax.hessian.lbfgs_curvature_diagnostics(s, y, skip_threshold=1e-08)[source]¶
Return
(s.y, relative_curvature, skipped)for the curvature pair.Exposes the same intermediate quantities
lbfgs_should_skipuses so the verbose callback can surface them without re-implementing the skip predicate.relative_curvature = |s.y| / (||s|| * ||y||)is the cosine of the angle betweensandyand the dominant contributor to the skip decision (a near-zero value indicates the curvature pair is at the floating-point noise floor and gets skipped by therelative_curvature < skip_thresholdbranch oflbfgs_should_skip).- Return type:
tuple[Float[Array, ''],Float[Array, ''],Bool[Array, '']]- Parameters:
s (Float[Array, 'n'])
y (Float[Array, 'n'])
skip_threshold (float)
- Args:
s: Step vector
x_{k+1} - x_k. y: Gradient difference∇L_{k+1} - ∇L_k(Lagrangian gradientdifference for SLSQP).
skip_threshold: Threshold passed to
lbfgs_should_skip.- Returns:
(sty, relative_curvature, skipped), wheresty = s.yis the raw curvature inner product,relative_curvatureis the normalised version used in the skip predicate, andskippedis the same booleanlbfgs_should_skipreturns.
- slsqp_jax.hessian.lbfgs_append(history, s, y, damping_threshold=0.2, skip_threshold=1e-08, diag_floor=0.0001, diag_ceil=1000000.0)[source]¶
Append a new (s, y) pair to the L-BFGS history with Powell damping.
Uses VARCHEN-style damping toward B0 (Lotfi et al., 2020, eq 7-8): damping is computed against the diagonal initial Hessian
B_0 = diag(diagonal)instead of the full L-BFGS approximation B. This is O(n) instead of O(k^2 n), always well-conditioned (the diagonal is clipped to[diag_floor, diag_ceil]), and avoids the circular dependency where a badly-conditioned B poisons its own damping.- Return type:
- Parameters:
history (LBFGSHistory)
s (Float[Array, 'n'])
y (Float[Array, 'n'])
damping_threshold (float)
skip_threshold (float)
diag_floor (float)
diag_ceil (float)
- The damped gradient difference is:
y_damped = theta * y + (1 - theta) * B0 s
- where theta in [0, 1] is chosen to satisfy:
s^T y_damped >= threshold * s^T B0 s
If
||s||is too small or the curvature ratio is too extreme, the update is skipped entirely to avoid numerical issues. Thecurvature_ratiobracket is[1e-8, 1e8](more permissive than the previous[1e-6, 1e6]) so we no longer drop noisy-but- informative pairs on high-dimensional problems; these pairs instead flow through Powell damping. Therelative_curvature < 1e-8skip is now gated onsᵀy > 0(positive curvature) so negative- curvature pairs are handled by damping rather than dropped.After appending, the scalar
gammais updated toy_damped^T y_damped / (y_damped^T s)and clipped to[diag_floor, diag_ceil]. The per-variable diagonal is updated using a component-wise secant:d[i] = |y_damped[i] * s[i]| / (s[i]^2 + eps)and clipped to the same bracket. This givesd[i] ≈ |H_{ii}|for diagonal Hessians regardless of step direction, unlike the classical Shanno-Phua formulad[i] = y_i^2 / (y^T s)which producesd ∝ h_i^2for multi-component steps.The old
1e-6floor onclip_loallowed the inverse-Hessian diagonalH_0 = 1/dto balloon to1e6, producing very large CG directions that the line search could not backtrack. Raising the floor to1e-4keeps the inverse diagonal bounded by1e4and dramatically reduces the “non-descent direction + chronic line- search backtracking” failure mode.- Args:
history: Current L-BFGS history. s: Step vector s = x_{k+1} - x_k. y: Gradient difference y = nabla L_{k+1} - nabla L_k. damping_threshold: Powell damping threshold (default 0.2). skip_threshold: Minimum step norm for update (default 1e-8). diag_floor: Minimum per-variable diagonal entry (default 1e-4). diag_ceil: Maximum per-variable diagonal entry (default 1e6).
- Returns:
Updated L-BFGS history with the new pair appended.
- slsqp_jax.hessian.lbfgs_estimate_condition(history, damping_threshold=0.2)[source]¶
Estimate eigenvalue bounds of the inverse Hessian approximation H.
Uses the diagonal
B_0 = diag(d)as a proxy for the full L-BFGS condition number. The inverse Hessian eigenvalues are bounded by[1/max(d), 1/min(d)], givingkappa(H) = max(d) / min(d).This is a practical simplification of the VARCHEN Theorem 2 bounds (Lotfi et al., 2020). The full recursive bounds are theoretically tight but overly pessimistic in practice because they propagate worst-case Lipschitz estimates, leading to false soft-reset triggers on moderately conditioned problems. Since we damp toward B0 (not the full B), the diagonal condition number is the most relevant quantity.
- Return type:
tuple[Float[Array, ''],Float[Array, '']]- Parameters:
history (LBFGSHistory)
damping_threshold (float)
- Returns:
(lambda_min_est, lambda_max_est) bounding the eigenvalues of H.
- slsqp_jax.hessian.lbfgs_soft_reset(history)[source]¶
VARCHEN-style soft reset: keep only the most recent (s, y) pair.
When the estimated condition number of the inverse Hessian exceeds a threshold, this drops all but the newest curvature pair. This is less aggressive than
lbfgs_reset()(which extracts the diagonal and drops everything) orlbfgs_identity_reset()(which restoresB = I), preserving the most relevant curvature information.Based on VARCHEN Algorithm 1, Step 7 (Lotfi et al., 2020).
- Return type:
- Parameters:
history (LBFGSHistory)
- slsqp_jax.hessian.lbfgs_compute_diagonal(history)[source]¶
Extract diag(B_k) from the L-BFGS compact representation.
From the compact form
B = B_0 - W M^{-1} W^T, the diagonal isdiag(B_k) = diagonal - diag(W M^{-1} W^T)
where
W = [B_0 S, Y]is(n, 2k)andMis the2k x 2kinner matrix. The correction term is computed in O(k^2 n) by formingQ = W M^{-1}and summing(Q * W)row-wise.This is used by
lbfgs_reset()to implement the SNOPT diagonal reset strategy (Gill, Murray & Saunders, 2005, Section 3.3).- Return type:
Float[Array, 'n']- Parameters:
history (LBFGSHistory)
- slsqp_jax.hessian.lbfgs_reset(history, diag_floor=0.0001, diag_ceil=1000000.0)[source]¶
SNOPT-style diagonal reset of the L-BFGS history.
Extracts
diag(B_k)from the current approximation, discards all stored(s, y)pairs, and restarts withB_0 = diag(diag(B_k)). This preserves per-variable curvature information across the reset, preventing the “everything is flat” effect that occurs when the scalargammabecomes very small.The extracted diagonal is clipped to
[diag_floor, diag_ceil]to keep the reset point aligned with the clipping used insidelbfgs_append(); previous versions used a hardcoded[1e-2, 1e6]that diverged from the append-path floor.Based on the SNOPT limited-memory reset strategy (Gill, Murray & Saunders, SIAM Review, 47(1), 2005, Section 3.3).
- Return type:
- Parameters:
history (LBFGSHistory)
diag_floor (float)
diag_ceil (float)
- slsqp_jax.hessian.lbfgs_identity_reset(history)[source]¶
Hard reset of L-BFGS history to the identity Hessian.
Unlike
lbfgs_reset()which preserves per-variable curvature, this discards everything and restarts withB_0 = I. Used as an escalation when repeated SNOPT-style diagonal resets fail to break an ill-conditioning cycle (e.g. consecutive QP failures where the extracted diagonal perpetuates the same problematic scaling).- Return type:
- Parameters:
history (LBFGSHistory)
- slsqp_jax.hessian.compute_partial_lagrangian_gradient(grad_f, eq_jac, multipliers_eq, gen_jac, multipliers_gen)[source]¶
Compute the partial Lagrangian gradient (without the bound block).
Returns
nabla f(x) - J_eq(x)^T lambda - J_gen(x)^T mu_gen, whereJ_genis the Jacobian of the general (nonlinear) inequality constraints — i.e. the inequality block excluding the constant identity-style rows that come from box bounds.This is the core helper used by both
compute_lagrangian_gradient(which adds the bound contribution on top) and the post-line-search NLP-level bound-multiplier recovery insolver.py: by reading off the partial Lagrangian gradient at indices that are at a bound atx_{k+1}, the recovery picks the bound multiplier that exactly zeros the corresponding component of the full Lagrangian gradient.- Return type:
Float[Array, 'n']- Parameters:
grad_f (Float[Array, 'n'])
eq_jac (Float[Array, 'm_eq n'])
multipliers_eq (Float[Array, 'm_eq'])
gen_jac (Float[Array, 'm_gen n'])
multipliers_gen (Float[Array, 'm_gen'])
- Args:
grad_f: Gradient of the objective function
nabla f(x). eq_jac: Jacobian of equality constraints (m_eq x n). multipliers_eq: Lagrange multipliers for equality constraints. gen_jac: Jacobian of general inequality constraints(m_gen x n). Must exclude bound rows.
- multipliers_gen: Lagrange multipliers for general inequality
constraints.
- Returns:
Partial Lagrangian gradient with no bound contribution.
- slsqp_jax.hessian.compute_lagrangian_gradient(grad_f, eq_jac, ineq_jac, multipliers_eq, multipliers_ineq)[source]¶
Compute the gradient of the Lagrangian function.
- Return type:
Float[Array, 'n']- Parameters:
grad_f (Float[Array, 'n'])
eq_jac (Float[Array, 'm_eq n'])
ineq_jac (Float[Array, 'm_ineq n'])
multipliers_eq (Float[Array, 'm_eq'])
multipliers_ineq (Float[Array, 'm_ineq'])
- The Lagrangian is:
L(x, lambda, mu) = f(x) - lambda^T c_eq(x) - mu^T c_ineq(x)
- Its gradient with respect to x is:
nabla_x L = nabla f(x) - J_eq^T lambda - J_ineq^T mu
ineq_jacandmultipliers_ineqmust include the bound block when bounds are present (the ineq layout used bysolver.pyis[general; lower_bound; upper_bound]). Internally this is built on top ofcompute_partial_lagrangian_gradient()so the bound-multiplier recovery in the outer solver shares the same code path for the non-bound contribution.- Args:
grad_f: Gradient of objective function nabla f(x). eq_jac: Jacobian of equality constraints (m_eq x n). ineq_jac: Jacobian of inequality constraints (m_ineq x n),
including bound rows.
multipliers_eq: Lagrange multipliers for equality constraints. multipliers_ineq: Lagrange multipliers for inequality
constraints, including bound multipliers.
- Returns:
Gradient of Lagrangian nabla_x L.
- slsqp_jax.hessian.estimate_hessian_diagonal(hvp_fn, n, key, n_probes=20)[source]¶
Estimate the diagonal of a matrix given only its HVP.
Uses the Bekas-Kokiopoulou-Saad (2007) stochastic estimator: for Rademacher random vectors z (entries i.i.d. from {-1, +1}), E[z * (H z)] = diag(H). Averaging over
n_probessamples gives an unbiased estimate with variance O(||off-diag||^2 / k).The estimate is accurate when the matrix is diagonally dominant, which is common for Hessians of smooth functions in the natural coordinate basis.
- Return type:
Float[Array, 'n']- Parameters:
- Args:
hvp_fn: Function v -> H @ v (Hessian-vector product). n: Dimension of the matrix. key: JAX PRNG key for generating random probes. n_probes: Number of Rademacher probes (default 20). More
probes reduce variance at the cost of more HVP calls. Each probe costs one forward-over-reverse AD pass.
- Returns:
Estimated diagonal of H, shape (n,).