RFC 0002 — State-of-the-art elevation: Stages 5 (scientific) and 6 (engineering)¶
| Status | DRAFT |
| Author | Idir Ben Slama |
| Created | 2026-05-17 |
| Last updated | 2026-05-17 |
| Discussion | landed alongside 0.8.3 (no PR review — author-acknowledged scope-setter) |
| Implementation PR(s) | (filled in as each phase ships) |
1. Summary¶
This RFC proposes the next two elevation stages beyond the 0.8.x Stage-3 + Stage-4 work already shipped. Stage 5 raises Ophamin from "internally rigorous + publicly browsable" to citable as a scientific methods framework (cross-framework validation, FWER correction, open benchmark deposit, research-grade reproducibility, peer review). Stage 6 raises it from "Apache-2.0 source on GitHub" to on the shelf next to scikit-learn / mlflow / pymc — PyPI + conda-forge installable, SLSA-signed releases, formal API stability policy, cross-language read APIs, community infrastructure.
The full per-phase detail lives in
ELEVATION_ROADMAP_2026_05_16.md §9–§12;
this RFC ratifies that detail under the L5 process so the plan is
reviewable + revisable + citable under the same lifecycle as any
future single-phase RFC.
2. Problem statement¶
The 0.8.x line closed every Stage-1–4 phase: internal hardening (S1–S6), external legitimacy (L1–L5), the Apache-2.0 relicense, the honest cross-platform coverage gate, the docs site under idirbenslama.github.io/Ophamin, and a portable Linux lockfile. The framework is internally rigorous and publicly browsable.
But the gap to state-of-the-art is concrete:
- No multiple-testing correction. Today
ophamin run-allproduces N proofs at independent α=0.05. With N=19 scenarios the probability of ≥ 1 spurious VALIDATED is ~62 %. A methods reviewer would flag this on first read of any draft. - No external validation of measurement machinery. Family T (Bayesian + CRDT) cross-checks tools against themselves (pycrdt ↔ y-py, pyitlib ↔ ennemi) but not against independent implementations in other ecosystems (Stan, Yjs JS, Garak).
- Not installable.
pip install ophamindoes not work; onlypip install -e .from a clone does. Conda-forge has no feedstock. New users cannot get the framework without a Git checkout. - No stability contract. SCHEMAS.md formalises the codec policy but no Python-API equivalent exists. Public symbols have no tier annotation; renames could break downstream silently.
- No SLSA-level release provenance. Tarballs and wheels are unsigned; downstream cannot cryptographically attribute a release artefact to a specific commit + builder.
- Python-only. No read API in Rust / JS — every consumer must speak Python.
- No published benchmark dataset. The 19 scenarios self-validate on synthetic substrates but the corpus is not citable as a benchmark from a downstream paper.
- No peer review. No external reviewer has rebuilt the framework from source and verified the signed-record output byte-equal. No methods paper exists in any venue's review pipeline.
These are not aspirational gaps — each one is a concrete deficiency a SOTA-comparable framework (scikit-learn / mlflow / pymc) would close before shipping a 1.0.
3. Proposed change¶
Adopt the 10 phases specified in
ELEVATION_ROADMAP_2026_05_16.md
as the ratified plan for Stages 5 and 6. Each phase is independent
and ships as its own minor or patch release; the order below is the
recommended dependency order, not a hard sequence.
3.1 Stage 5 — scientific tier (E1–E5)¶
| Phase | Deliverable | Acceptance |
|---|---|---|
| E1 Cross-framework validation | GWF↔Garak + Bayesian↔Stan + CRDT↔Yjs cross-checks; ≥ 3 signed VALIDATED proofs | proofs land under proofs/measurement_machinery/; agreement threshold documented per proof |
| E2 Formal statistical rigor | FWER correction (Holm–Bonferroni + Benjamini–Hochberg) on CampaignRecord; Bayesian posterior updating across scenario runs; per-scenario power analysis | new pillars M.fwer + E.power; CampaignRecord schema → 2.0; corrected + raw verdicts both surfaced |
| E3 Open data + benchmarks | Zenodo-deposited benchmark corpus (100+ signed proofs + synthetic substrates); per-scenario reproducer notebooks; one methods paper draft | DOI separate from framework's; notebooks for ≥ 6 scenarios; paper draft in a software-paper venue (JOSS / SoftwareX / JMLR-OSS) |
| E4 Research-grade reproducibility | Per-OS lockfiles for every supported triple; deterministic-seed propagation audit; SLSA-3 attestations; cosign-signed container images; diffoscope-clean builds | external reviewer rebuilds a tagged release + verifies byte-equal SBOM + signed-record output |
| E5 Peer review + publication | Methods paper submitted; reviewer-time feedback addressed | accepted at JOSS / SoftwareX / JMLR-OSS |
3.2 Stage 6 — engineering tier (E6–E10)¶
| Phase | Deliverable | Acceptance |
|---|---|---|
| E6 PyPI + conda-forge + multi-platform wheels | Trusted-Publishing-based PyPI workflow; conda-forge feedstock; cibuildwheel scaffold |
pip install ophamin works; conda recipe auto-updates per release |
| E7 Signed releases + SLSA provenance | Sigstore cosign signatures on all release artefacts; SLSA-3 build provenance; transparency-log entries | cosign verify + gh attestation verify succeed end-to-end |
| E8 API stability policy | Tier annotations (Stable / Provisional / Internal / Deprecated) on every public symbol; regression test pinning signatures; ophamin api-stability check CLI |
renames fail the stability suite at PR time; deprecation policy mirrors SCHEMAS.md |
| E9 Cross-language interop | crates/ophamin-proof (Rust); packages/ophamin-proof-js (TS); both verify signed-record bytes |
byte-equal signature verification across Python + Rust + JS on a 100-proof fixture |
| E10 Community infrastructure | GitHub Discussions; CODE_OF_CONDUCT.md; Sponsor button; GOVERNANCE.md; ROADMAP.md | visible from project root; CoC + governance reviewable |
3.3 Public surface impact¶
- CLI: adds
ophamin compare update-posterior,ophamin scenario power-check,ophamin api-stability check(E2, E2, E8 respectively). All exit-coded per the existing convention (0 happy / 2 documented failure / 64 misuse). - Codec / wire format:
CampaignRecordbumpsschema_versionto2.0to include corrected verdicts (E2). Reader continues to support1.0per the SCHEMAS.md backward-compat read-rule policy. No other schema changes. - Protocol interface:
ScenarioScoregains optionalminimum_detectable_effect: float | None(E2). Existing scenarios continue to work; the field is computed lazily from the signed proof's pre-registered claim when present. - Optional dependencies: new extras
[bayesian_stan](E1),[cosign](E7),[stability_test](E8).[all]continues to exclude provisional extras to keep the install footprint honest.
3.4 Backward compatibility¶
- Reading older records:
CampaignRecord/1.0records remain readable indefinitely per SCHEMAS.md §"Backward compatibility on read". E2 adds a 2.0 producer; the reader stays both-versions aware. - Writing: existing producers continue to work unmodified. Pre-registration discipline is unchanged; FWER correction is an additional surface on the aggregate, not a replacement.
- Deprecation: no fields removed. The pre-E2 single-α verdict
semantics remain available via the
raw_verdictfield on the bumped CampaignRecord.
4. Alternatives considered¶
4.1 Alternative A — ship 1.0 with Stages 1–4 only¶
Concrete and shippable today. But premature: a 1.0 implies an explicit stability contract (E8) and an installable distribution (E6) — neither of which exists yet. Shipping 1.0 without them locks in the current surface while still calling it stable, which is dishonest.
4.2 Alternative B — ship Stage 5 (science) only, defer Stage 6¶
The methods paper (E5) would land before the framework is
pip install-able (E6). New readers of the paper would find
the framework only as a Git clone — which they'd reasonably
read as "this is research code, not production". The science
and engineering tiers are tightly coupled at the reception
layer; doing one without the other under-sells both.
4.3 Alternative C — ship Stage 6 (engineering) only, defer Stage 5¶
PyPI publication + SLSA signing without FWER correction or cross-framework validation produces a polished distribution of a methodology that hasn't been formally peer-reviewed. The reverse of 4.2: looks production, but a methods reviewer would still flag the multiple-testing gap on first read.
4.4 Do nothing¶
The framework stays internally rigorous + publicly browsable. New users continue to install via clone. The known statistical gap (~62 % spurious-verdict probability at run-all scale) remains. Citations remain owner-side outreach rather than reviewer-validated. Honest baseline if elevation is no longer the goal — but it's expressly the opposite of the user directive on 2026-05-17.
5. Drawbacks¶
- Scope is large. Stage 5 estimated 13–24 sessions; Stage 6 estimated 8–14 sessions. Cumulative 21–38 sessions of focused work. At Ophamin's current cadence (one Stage = days–weeks per phase) this is a 2–4 month elevation.
- Some phases depend on external review (E5 peer review, E4 reviewer-side rebuild). The framework cannot single-handedly close them; the L5-ratified plan must explicitly mark these as owner-driven external-dependency phases.
- Schema bump (CampaignRecord → 2.0) is the first non-trivial versioned bump. Validates the SCHEMAS.md policy under load but also exercises the migration-script discipline for the first time.
- Cross-language codecs (E9) double the maintenance surface. Adds a Rust crate + a TS package to keep in lockstep with the Python schema. Mitigation: read-only codecs only; cross-language write is deferred to a future RFC.
- PyPI Trusted Publishing requires GitHub-side OIDC config the agent cannot set unilaterally. Owner-side step on the critical path for E6.
- The roadmap could drift. This RFC ratifies the plan as of 2026-05-17. Re-ratification (a follow-on RFC) is required if any phase changes scope substantially.
6. Acceptance criteria¶
The RFC itself is ACCEPTED when:
- [x] This document exists at
docs/rfc/0002-sota-elevation-stages-5-and-6.md - [x] The 10 phases (E1–E10) are individually specified with acceptance criteria + estimated effort + dependency notes
- [x] Stages 5 + 6 are visible in
ELEVATION_ROADMAP_2026_05_16.md§9–§12 - [x] The CHANGELOG entry for the release that lands this RFC links to it
- [x] mkdocs
--strictbuild succeeds with the new RFC in nav
The full elevation is COMPLETE when every per-phase acceptance criterion in §3.1 + §3.2 has its own VALIDATED checkbox. Each phase ships under its own minor or patch release; this RFC does NOT have to land as a single PR with all 10 phases.
7. Migration plan¶
7.1 CampaignRecord 1.0 → 2.0 (E2)¶
The schema bump is a strictly additive change: 2.0 adds
corrected_verdicts: dict[str, str] (mapping pre-registered claim
ID → Holm/BH-corrected verdict) and multiplicity_correction_method:
str (one of "holm", "bh", "none"). 1.0 readers ignore the new
fields; 2.0 readers compute corrected verdicts lazily from raw
verdicts if the field is absent (preserving backwards-read).
Migration tool: tools/migrations/campaign_1_to_2.py — accepts a
directory of 1.0 records, emits 2.0 records into a sibling
directory, byte-faithful for every other field. Pre-registration
metadata is preserved verbatim.
7.2 ScenarioScore minimum_detectable_effect (E2)¶
Strictly additive. No migration required; existing producers emit
records without the field; existing readers fall back to None.
7.3 No other migrations required¶
E1, E3, E4, E5–E10 add new artefacts but do not modify any existing signed-record schema.
8. Open questions¶
- JOSS vs SoftwareX vs JMLR-Open-Source-Software for E5. All three are viable. JOSS has the fastest turnaround but narrowest scope; JMLR-OSS the highest prestige but slowest review. Resolution: owner-territory, picked at submission time.
- Conda-forge feedstock owner. First-time submissions need a maintainer to commit to the feedstock; owner-territory.
- Rust vs Go for the cross-language codec in E9. Rust chosen in the roadmap (better cosign / signature-verification crate ecosystem). Open to revision if a downstream use case demands Go.
- SLSA Level 3 vs Level 4 for E7. Level 3 is the realistic GitHub-Actions ceiling today; Level 4 requires two-party review of every build, which conflicts with the single-author reality. Resolution: ship Level 3; document the Level-4 gap honestly.
- Whether to require a deterministic-seed audit for every scenario or just the ones with statistical claims. E4 says "every scenario"; pragmatically, the structural-tier scenarios (Family S substrate-completeness) may not need it. Resolution: start with the statistical-tier scenarios (Families A, B, C, L, M, U, V); revisit at E4 closeout.
9. References¶
ELEVATION_ROADMAP_2026_05_16.md— the per-phase plan this RFC ratifiesSCHEMAS.md— the codec versioning policy E2's schema bump exercisesCHANGELOG.md— release-cadence record; each phase appears here as it ships- RFC 0001 — the retrospective pointer; this RFC is the first forward-looking ACCEPTED use of the L5 process once it merges
- Upstream prior art:
- JOSS submission guidelines: https://joss.readthedocs.io/
- SLSA framework: https://slsa.dev/
- sigstore cosign: https://docs.sigstore.dev/cosign/
- PyPI Trusted Publishing: https://docs.pypi.org/trusted-publishers/
- Holm–Bonferroni correction (Holm 1979): DOI 10.2307/4615733
- Benjamini–Hochberg correction (1995): DOI 10.1111/j.2517-6161.1995.tb02031.x