Performance Benchmarks
======================

The figures on this page were produced by running ``benchmarks/bench_solvers.py``
from the repository root.  To regenerate the figures from a fresh benchmark run::

   python benchmarks/bench_solvers.py
   python benchmarks/plot_results.py

Hardware context for the results shown here:

* **CPU**: AMD Ryzen 7 7840U (16 logical cores)
* **RAM**: 46.4 GiB
* **OS**: Linux 6.17, Ubuntu 24.04
* **Python** 3.12.7 · **NumPy** 1.26.4 · **SciPy** 1.16.3

Absolute timings will differ on other hardware; relative performance between
methods and cache modes is broadly representative.

Solver scaling
--------------

.. figure:: _static/bench_scaling.png
   :width: 100%
   :alt: Solve time vs. grid size for FD, FFT, and SAS (1-D and 2-D)

The **FD direct solver** uses a sparse LU factorisation and scales roughly as
:math:`O(N^{1.5})` in two dimensions, where :math:`N` is the total number of
grid cells.  At 400×400 cells (:math:`N^2 = 160\,000`) a single FD solve takes
about 5 s; at 200×200 it takes about 0.6 s.

The **FFT** method solves the problem in the spectral domain and scales as
:math:`O(N \log N)`.  In two dimensions it is roughly two orders of magnitude
faster than FD at the same grid size, but requires a uniform (scalar) elastic
thickness and imposes periodic boundary conditions (or zero-padded behaviour
with ``no_outside_loads``).

The **SAS** method superposes analytical deflection kernels using
``fftconvolve``, scaling as :math:`O(N^2 \log N)` in two dimensions.  It is
load-pattern-independent and faster than FD for small to moderate grids,
but becomes comparable to or slower than FD at very large grids.  It also
requires a uniform elastic thickness.

The Te profile has a modest effect on FD timing: the shaded band (min to max
across all profiles tested — constant, sinusoidal, abrupt step, tanh sigmoidal,
correlated noise, wide dynamic range) is narrow, typically within ±15 % of the
mean.  Spatially noisy profiles show slightly higher times at large grid sizes,
likely due to increased fill-in during sparse LU factorisation.

LU factorisation cache
-----------------------

.. figure:: _static/bench_lu_cache.png
   :width: 100%
   :alt: LU cache speedup via the run() path

   **Speedup of the ``True`` and ``"no_check"`` cache modes over uncached
   (``False``) via the full** :meth:`~gflex.F1D.run` **path.**
   The band shows the range across all Te profiles; the line is the mean.
   The dotted line at 1× marks no speedup.

When the load :math:`q_s` changes between calls but the grid, elastic
thickness, and boundary conditions remain fixed, the coefficient matrix does
not change.  Caching the LU factorisation eliminates the cost of re-factorising
on every call, reducing the per-solve work to a single triangular solve.

In two dimensions the **``"no_check"``** mode reaches **7–12× speedup** over
uncached at grid sizes of 50×50 to 400×400 cells.  The speedup grows with
grid size because the factorisation cost (eliminated by caching) scales as
:math:`O(N^{1.5})` while the cached solve (triangular back-substitution) scales
as :math:`O(N)`.

The **``True``** mode (hash-validated cache reuse) provides a similar trend
but at a lower speedup because it computes and compares a matrix hash on every
call.  For large 2-D grids the hash cost is a small fraction of the total, so
``True`` and ``"no_check"`` converge.  For 1-D problems, where solve times are
short, the hash overhead is comparatively larger.

The timings include all overhead from a ``run()`` call: :meth:`bc_check`,
coordinate setup, warning checks, and the :meth:`_solve_fd` cache-bypass logic.
This is more representative of real coupling-loop performance than timing
:meth:`fd_solve` in isolation.

See :doc:`api` for usage details, including the smart invalidation mechanism
that automatically clears the cache when any matrix-determining input (``te``,
``dx``, boundary conditions, etc.) is reassigned.

Cost of changing :math:`T_e`
-----------------------------

.. figure:: _static/bench_te_sweep.png
   :width: 100%
   :alt: Per-solve cost: load-only vs Te change

   **Per-solve cost when only the load changes (load-only, ``"no_check"``
   mode) versus when the scalar elastic thickness :math:`T_e` changes on
   every call (Te change, any cache mode).**

Reassigning :math:`T_e` triggers smart cache invalidation: the coefficient
matrix is cleared and the LU factorisation is discarded.  The next
:meth:`~gflex.F1D.run` call rebuilds the matrix from scratch and re-factorises.
All three cache modes (``False``, ``True``, ``"no_check"``) pay essentially the
same cost when :math:`T_e` changes, because the rebuild dominates the per-call
budget.

At 200×200 cells the load-only cost is roughly **53 ms** per solve while a
:math:`T_e` change costs roughly **590 ms** per solve — about an **11× penalty**.
At 400×400 cells the load-only cost is roughly **400 ms** per solve while a
:math:`T_e` change costs roughly **5.3 s** — about a **13× penalty**.
The gap widens with grid size because factorisation scales as :math:`O(N^{1.5})`
and the incremental triangular solve scales as :math:`O(N)`.

The practical implication: in a coupling loop where :math:`T_e` is fixed and
only :math:`q_s` varies (e.g., a transient ice-sheet or sediment-loading model),
``cache_factorization = "no_check"`` or ``True`` provides substantial speedup.
In a parameter-sweep or inversion where :math:`T_e` changes on every iteration,
all modes are equivalent and the rebuild cost is unavoidable.