GeodeBench v0 (MVP)

S slice (quadratic): t2 only; target alpha.
G slice (multivariate): small t2,t3 nonzero; target alpha.

Tasks: - Coefficient prediction: recover truncated coefficients from samples. - Geode recovery: predict alpha from slice inputs. - Invariance checks: evaluate symmetry/generalization splits.

Generate data (small demo):

python bench/generate_slices.py --degrees 3,5,8 --trials 10 --out docs/assets/geode_slices.csv

Large-scale generator with splits/shards:

python bench/generate_geodebench.py \
  --degrees 3,5,8,12,16,20 \
  --trials 1000 \
  --slices S,FUSS3,FUSS4,BITRI,MIXED \
  --out_prefix docs/assets/geodebench \
  --shard_size 50000 \
  --seed 0
# writes docs/assets/geodebench_part-0000.csv, ...

Splits: - easy: S, FUSSd slices - medium: BITRI - hard: MIXED (and unseen higher degrees)

Invariance (scaffolding): - Euler residual | (V - E + F) - 1 | averaged over small layering levels; lower is better.

Starter notebook: notebooks/GeodeBench_Starter.ipynb

Baselines:

# Linear baseline
python scripts/baseline_transformer.py --in docs/assets/geode_slices.csv --out_csv docs/assets/gb_baseline_demo.csv

# Tiny Transformer (PyTorch)
python scripts/baseline_tiny_transformer.py --in docs/assets/geode_slices.csv --out_csv docs/assets/gb_tinytx_demo.csv --epochs 10 --hidden 64 --heads 4 --layers 2

# Sharded input example
python scripts/baseline_tiny_transformer.py --in docs/assets/geodebench_part-0000.csv,docs/assets/geodebench_part-0001.csv --out_csv docs/assets/gb_tinytx_sharded.csv --epochs 3

Leaderboard submission (CSV): - Columns: Method,Split,Metric,Score,NumExamples - Validate locally:

python scripts/leaderboard_validate.py --submission docs/assets/gb_tinytx_sharded.csv

Head-to-head and break-even (predict+polish):

# Generate quick Newton vs Hybrid CSV
python scripts/bench_newton_vs_hybrid.py --degrees 3,5,8 --trials 3 --out docs/assets/newton_vs_hybrid_quick.csv

# Compute break-even vs Newton assuming AI inference=0.05ms and 40% polish speedup
python scripts/bench_h2h_predict_polish.py \
  --bench_csv docs/assets/newton_vs_hybrid_quick.csv \
  --base_method hybrid \
  --compare_method newton \
  --inference_ms 0.05 \
  --polish_factor 0.6 \
  --out_prefix docs/assets/h2h_quick
# Outputs: docs/assets/h2h_quick_detail.csv, h2h_quick_summary.csv

Interpretation: - On small degrees and CPU, Newton is extremely fast; AI+polish won’t beat it unless inference is near-zero and polish reduces time drastically. - On harder distributions (clusters, |t|≈1) or with GPU batched polish, the break-even moves in favor of AI+polish.

Why predict+polish makes sense

We compare Newton-only time t_newton to predict+polish time t_ai = t_infer + s_polish × f_reduction × t_polish_baseline: - t_infer: model inference per instance - s_polish: hardware scaling (e.g., GPU speedup for polish) - f_reduction: reduction in polish work from better seeds (measured) - t_polish_baseline: polish time from a standard seed

Break-even when t_ai < t_newton. This holds in practical regimes: - Batched workloads (vision calibration/inversion; spectral/AR roots) - Streaming/nearby instances (control pole placement) - Tough root geometry (clusters, near-multiples, |t|≈1)

Next in docs: we will report measured f_reduction from learned seeds and GPU s_polish, plus exact thresholds per degree and toughness.

Break-even results (early)

Artifacts: - Cluster epsilon summary: docs/assets/h2h_tough_hd_cluster_epsilon_summary.csv - Scale ratio summary: docs/assets/h2h_tough_hd_scale_ratio_summary.csv - Plots: docs/assets/h2h_cluster_epsilon.png, docs/assets/h2h_scale_ratio.png

Highlights: - deg=32 (cluster): first epsilon where AI+polish wins is ~1e-5 (with GPU-like polish scaling). - deg=32 (ill-scaled): first scale ratio where AI+polish wins is ~1.0 (with GPU-like polish scaling). - deg=24: no wins yet on sampled grid; tighter epsilons and more seeds may change this.

High-degree batched polish (CPU vs MPS)

Plot: docs/assets/high_degree_cpu_vs_mps.png

Measured (MPS on Apple Silicon): - deg=64, B=1024: near parity. - deg=128, B=1024: ~2.9× faster vs CPU. - deg=256, B=512: ~2.1× faster vs CPU. - deg=512, B=512: ~3.6× faster vs CPU. - deg=1024, B=256: ~3.4× faster vs CPU.

Newton vs AI+Polish at high degrees

Plot: docs/assets/newton_vs_ai_polish_hd.png

Newton modeled from degrees 32/64 (quadratic trend) vs AI+Polish per-instance time derived from batched CPU and MPS runs (inference=0.05ms, polish_factor=0.6).
Shows the crossover where AI+Polish overtakes Newton at higher degrees, especially with MPS.

Guidance: For high-degree, batched workloads in control/filters/PDE/crypto, MPS gains are material; predict+polish is compelling when paired with batched GPU polish.

Leaderboard (v0 preview)

Method	Split	Metric	Score
Linear baseline	Random (S+G)	MAE(	alpha
Naive OEIS-style	Symmetry holdout	MAE(	alpha
Tiny Transformer (stub)	Symmetry holdout	MAE(	alpha