Performance Benchmarks

The figures on this page were produced by running benchmarks/bench_solvers.py from the repository root. To regenerate the figures from a fresh benchmark run:

python benchmarks/bench_solvers.py
python benchmarks/plot_results.py

Hardware context for the results shown here:

  • CPU: AMD Ryzen 7 7840U (16 logical cores)

  • RAM: 46.4 GiB

  • OS: Linux 6.17, Ubuntu 24.04

  • Python 3.12.7 · NumPy 1.26.4 · SciPy 1.16.3

Absolute timings will differ on other hardware; relative performance between methods and cache modes is broadly representative.

Solver scaling

Solve time vs. grid size for FD, FFT, and SAS (1-D and 2-D)

The FD direct solver uses a sparse LU factorisation and scales roughly as \(O(N^{1.5})\) in two dimensions, where \(N\) is the total number of grid cells. At 400×400 cells (\(N^2 = 160\,000\)) a single FD solve takes about 5 s; at 200×200 it takes about 0.6 s.

The FFT method solves the problem in the spectral domain and scales as \(O(N \log N)\). In two dimensions it is roughly two orders of magnitude faster than FD at the same grid size, but requires a uniform (scalar) elastic thickness and imposes periodic boundary conditions (or zero-padded behaviour with no_outside_loads).

The SAS method superposes analytical deflection kernels using fftconvolve, scaling as \(O(N^2 \log N)\) in two dimensions. It is load-pattern-independent and faster than FD for small to moderate grids, but becomes comparable to or slower than FD at very large grids. It also requires a uniform elastic thickness.

The Te profile has a modest effect on FD timing: the shaded band (min to max across all profiles tested — constant, sinusoidal, abrupt step, tanh sigmoidal, correlated noise, wide dynamic range) is narrow, typically within ±15 % of the mean. Spatially noisy profiles show slightly higher times at large grid sizes, likely due to increased fill-in during sparse LU factorisation.

LU factorisation cache

LU cache speedup via the run() path

Speedup of the ``True`` and ``”no_check”`` cache modes over uncached (``False``) via the full run() path. The band shows the range across all Te profiles; the line is the mean. The dotted line at 1× marks no speedup.

When the load \(q_s\) changes between calls but the grid, elastic thickness, and boundary conditions remain fixed, the coefficient matrix does not change. Caching the LU factorisation eliminates the cost of re-factorising on every call, reducing the per-solve work to a single triangular solve.

In two dimensions the ``”no_check”`` mode reaches 7–12× speedup over uncached at grid sizes of 50×50 to 400×400 cells. The speedup grows with grid size because the factorisation cost (eliminated by caching) scales as \(O(N^{1.5})\) while the cached solve (triangular back-substitution) scales as \(O(N)\).

The ``True`` mode (hash-validated cache reuse) provides a similar trend but at a lower speedup because it computes and compares a matrix hash on every call. For large 2-D grids the hash cost is a small fraction of the total, so True and "no_check" converge. For 1-D problems, where solve times are short, the hash overhead is comparatively larger.

The timings include all overhead from a run() call: bc_check(), coordinate setup, warning checks, and the _solve_fd() cache-bypass logic. This is more representative of real coupling-loop performance than timing fd_solve() in isolation.

See API Reference for usage details, including the smart invalidation mechanism that automatically clears the cache when any matrix-determining input (te, dx, boundary conditions, etc.) is reassigned.

Cost of changing \(T_e\)

Per-solve cost: load-only vs Te change

Per-solve cost when only the load changes (load-only, ``”no_check”`` mode) versus when the scalar elastic thickness :math:`T_e` changes on every call (Te change, any cache mode).

Reassigning \(T_e\) triggers smart cache invalidation: the coefficient matrix is cleared and the LU factorisation is discarded. The next run() call rebuilds the matrix from scratch and re-factorises. All three cache modes (False, True, "no_check") pay essentially the same cost when \(T_e\) changes, because the rebuild dominates the per-call budget.

At 200×200 cells the load-only cost is roughly 53 ms per solve while a \(T_e\) change costs roughly 590 ms per solve — about an 11× penalty. At 400×400 cells the load-only cost is roughly 400 ms per solve while a \(T_e\) change costs roughly 5.3 s — about a 13× penalty. The gap widens with grid size because factorisation scales as \(O(N^{1.5})\) and the incremental triangular solve scales as \(O(N)\).

The practical implication: in a coupling loop where \(T_e\) is fixed and only \(q_s\) varies (e.g., a transient ice-sheet or sediment-loading model), cache_factorization = "no_check" or True provides substantial speedup. In a parameter-sweep or inversion where \(T_e\) changes on every iteration, all modes are equivalent and the rebuild cost is unavoidable.