Ophamin elevation roadmap — from working to solid + legit¶

Status: strategic plan, 2026-05-16. Written after every open item from the prior architecture audits had been closed (Moves A through N, ending at v0.4.0). The framework now does what the README + the protocols.py + the architectural docs say it does. This roadmap is the next-stage question:

How do we elevate Ophamin to something more solid and legit?

"Solid" is internal: would survive a hostile code review by a distributed-systems team. "Legit" is external: would survive a hostile review by an academic / engineering audit body.

Owner-locked constraints (2026-05-16):

Open-source. The framework ships under the Apache License 2.0 (see LICENSE + NOTICE). Every elevation phase assumes public-OSS posture: public docs site, public CI, DOI-citable, community-shaped RFC process, validation studies open to third-party replication.

Do not rename Ophamin. The name (from the angelic order Ophanim — "wheels within wheels, covered with eyes") is the framework's stable identifier. Architectural changes happen under the existing name; downstream forks that diverge architecturally choose their own name rather than retaining "Ophamin" (codified in NOTICE).

These two constraints lock in the full 4-stage elevation plan (~20–30 sessions). The §7 license-decision question below is now resolved: open-source.

0. Where we are¶

The framework's epistemic shape is finished:

6 wheels (seeing / measuring / comparing / instrumenting / auditing / reporting) all produce signed artifacts.
4 plug-in Protocols (Pillar / ScenarioProtocol / DatasetConnector / SubstrateProbe) all have a registration + discovery surface.
42 CLI subcommands covering scenario / proof / audit-record / pillar / corpus / substrate / drift-detect / watch-proofs / inspect / inspect-all / report / report-batch / summarize / diagnose / analyze / run-all + 26 others.
5 result-record types (EmpiricalProofRecord, AuditRecord, DriftScan, RegressionAlertRecord, CampaignRecord) all signed + content-addressed + HMAC-verifiable + JSON-round-trippable.
1,148 tests passing across 32+ test files; 0 failures.

What's NOT there yet:

The framework runs against ~10% of Kimera's observable surface (per docs/KIMERA_OBSERVATIONAL_SURFACE_2026_05_15.md — measurement coverage, not framework architecture).
No external validation against another framework or ground-truth benchmark.
No published doc site, no DOI, no public CI, no security policy.
No formal types contract (mypy passes but isn't enforced strict).
No performance benchmarks pinned.
No formal RFC process for design changes.

1. What "solid" means + 6 phases that get us there¶

Phase S1 — type-checked end-to-end (mypy strict).

Configure mypy --strict for src/ophamin/.
Resolve type-errors layer-by-layer: protocols → registry → measuring/proof → comparing/synthesis → comparing/regression_alert → comparing/drift_detection → auditing → inspecting → reporting → cli.
Add py.typed marker so downstream consumers see the types.
CI gate: mypy strict must pass on every PR.
Estimated effort: 3-5 sessions. Effort is in fixing existing latent type imprecisions, not in changing the architecture.

Phase S2 — coverage measurement + targets.

Configure coverage.py with branch coverage.
Establish current baseline (likely ~80-90% line, ~70-80% branch).
Set targets: ≥ 90% line / ≥ 85% branch on every wheel.
CI gate: coverage may not regress on a PR.
Surface uncovered lines / branches in PR comments.
Estimated effort: 1-2 sessions.

Phase S3 — performance benchmarks.

tests/bench/ directory with pytest-benchmark micro-bench per pillar (SPC chart fitting on N=10⁴ samples, SPRT update cost, MixedLM fit cost), per codec (proof dump+load+verify_signature round-trip cost), per CLI cold-start (time ophamin --version).
Pin baseline numbers in a BENCHMARKS.md table.
CI gate: regression > 20% on any pinned bench fails.
Estimated effort: 2-3 sessions.

Phase S4 — reproducible builds + lockfile + container image.

requirements-lock.txt from pip-compile --strip-extras.
Optional uv.lock (uv replacement for pip).
Dockerfile that pins Python 3.12 + lockfile + entry-point ophamin verify.
Verified-rebuild guarantee: same lockfile + same Dockerfile = same ophamin --version + same test outcome.
Estimated effort: 1-2 sessions.

Phase S5 — supply-chain hygiene.

ophamin audit pyproject.toml (which uses pip-audit + cyclonedx-python-lib) emits a signed SBOM that ships with the release.
osv-scanner integration; weekly cron CI run; alert on new CVE.
Auto-fail CI if pip-audit reports a HIGH-or-CRITICAL CVE.
Published SECURITY.md already exists; add responsible-disclosure email + PGP key.
Estimated effort: 1 session.

Phase S6 — formal correctness specs.

For each signed-record codec, prove the round-trip invariant (load(dump(r)) == r in canonical form) via property-based test (hypothesis).
For the Pillar Protocol, prove isinstance(p, Pillar) for every registered pillar (Move G already does this; formalize as a property-test).
For the CRDT Laws scenario, prove the four laws hold across Hypothesis-generated op sequences (the scenario does this; formalize as a property test).
Estimated effort: 2-3 sessions.

S1–S6 cumulative effort: 10–16 sessions.

2. What "legit" means + 6 phases that get us there¶

Phase L1 — public documentation site.

mkdocs-material configured with:
the entire docs/ tree pre-rendered,
per-module API reference (mkdocstrings),
tutorial: "your first scenario in 5 minutes",
tutorial: "wrap a third-party pillar in 50 LOC",
tutorial: "from ophamin run-all to a published proof",
architecture: re-render of the audit + extended-audit docs as canonical pages,
CHANGELOG mirror.
GitHub Pages deployment on every push to main.
Custom domain pointer (e.g. ophamin.idirbenslama.dev).
Estimated effort: 2-3 sessions.

Phase L2 — CITATION.cff + Zenodo DOI.

CITATION.cff already exists; verify metadata (authors, identifiers, version, keywords).
Connect Zenodo to the GitHub repo; mint a DOI on the next tagged release.
README badge: [![DOI](...)].
Estimated effort: 0.5 session.

Phase L3 — public CI.

GitHub Actions: matrix run across Python 3.12 / 3.13.
Jobs: lint (ruff) → typecheck (mypy strict) → test (pytest -q + coverage) → audit (pip-audit) → bench (pytest-benchmark with baseline comparison) → docs build (mkdocs build).
Badges in README.
Branch protection: every PR requires CI green.
Estimated effort: 1 session (workflows exist, need refinement).

Phase L4 — versioned schemas with migration guarantees.

Maintain a SCHEMAS.md cataloguing every signed-record schema (EmpiricalProofRecord 1.0, AuditRecord 1.0 + 1.1, DriftScan 1 + 2, CampaignRecord 1.0, RegressionAlertRecord 1.0). For each, declare:
current version
backward-compat read-policy (which older versions the codec accepts)
migration script when a major bump happens
guaranteed-stable fields vs deprecated-fields-with-removal-date
ophamin schema validate <record.json> CLI that runs the appropriate codec's structural check.
Semver promise: minor versions never break existing record JSONs; major versions ship migration scripts.
Estimated effort: 1-2 sessions.

Phase L5 — RFC process + contributor onboarding.

docs/rfc/0001-template.md + docs/rfc/README.md documenting the process.
Convert the existing architecture audits into RFC-numbered documents in retrospect (RFC 0001: Move A scenario metadata, RFC 0002: Move B proof codec, …).
CONTRIBUTING.md already exists; expand with the RFC-first rule for design changes vs the PR-first rule for bug fixes.
Add a "good first issue" label workflow.
Estimated effort: 1 session.

Phase L6 — scientific validation studies.

For each shipped scenario, publish (in docs/validation/):
the cross-framework comparison: "Immune Siege using Ophamin's pipeline vs running the same offensive-security corpus through Garak / promptfoo / [other]"
the substrate-independence study: run the scenario against MockSubstrate with controlled noise; verify the verdict tracks the noise level as predicted.
reproducibility report: same seed + same Kimera commit + same Ophamin commit → bit-identical proof_id.
Cross-validate the CRDT-laws scenario against Yjs's own JS test suite (translate one of theirs into a Hypothesis strategy).
Estimated effort: 4-6 sessions.

L1–L6 cumulative effort: 9.5–13.5 sessions.

3. Sequenced execution plan¶

If we commit to elevation, the cleanest order is:

Stage	Phases	Sessions	Outcome
Stage 1 — internal hardening	S1 (mypy strict), S2 (coverage), S3 (benchmarks)	6–10	"every regression is caught by CI"
Stage 2 — reproducible + secure	S4 (lockfile + container), S5 (supply chain), S6 (property tests)	4–6	"any reviewer can rebuild bit-identically; any CVE is alerted within 24h"
Stage 3 — public legitimacy	L1 (docs site), L2 (DOI), L3 (public CI), L4 (schema policy)	4.5–6.5	"citable + browsable + every PR shows green; consumers can rely on schemas"
Stage 4 — community + science	L5 (RFC process), L6 (validation studies)	5–7	"third-party contributors can navigate the design space; the framework's claims are independently checkable"
Stage 5 — state-of-the-art scientific tier	E1 (cross-framework validation), E2 (statistical rigor), E3 (open data + benchmarks), E4 (research-grade reproducibility), E5 (peer review + publication)	10–18	"the framework's claims meet bar for a methods paper; cited externally"
Stage 6 — state-of-the-art engineering tier	E6 (multi-platform wheels + PyPI/conda), E7 (signed releases + SLSA provenance), E8 (deprecation + stability policy), E9 (cross-language interop), E10 (community infrastructure)	8–14	"Ophamin is on the shelf next to scikit-learn / mlflow / pymc as a citable + installable + maintainable scientific framework"

Total: ~38–55 sessions if every phase lands. Each stage is independently shippable; the owner picks the cut-off. Stages 5–6 target state-of-the-art legitimacy — see §9 + §10 below for the detailed phase breakdowns.

4. The single highest-leverage move¶

If only ONE phase landed, the highest leverage by far is L1 (public documentation site) — because it's the gating phase for every other "legit" claim (no DOI without docs to cite; no community without docs to onboard; no scientific reviewer without docs to parse). It also surfaces every architectural gap the architecture audits already named (gap E inner-triad asymmetry surfaces visually in a tree of mkdocs pages; gap F regression-alert daemon now has a home as a tutorial).

If two phases: L1 + S1 (mypy strict). Together they give the framework a public face + a defensible internal contract.

5. What we should NOT do¶

Don't rename / re-brand the OFAMIN initialism. The pillar count has outgrown the six letters (gap D from the prior audit named this). The right move is to accept OFAMIN as a historical anchor rather than to retrofit a new acronym. Renames break every link in every doc + every shipped proof's pillar attribution.
Don't try to ship a pip install ophamin to PyPI without the L4 schema policy. Once a schema is in the wild, breaking it is a betrayal of consumers; a clear schema-policy document must precede any public distribution.
Don't open-source the framework before L5 (RFC process). Without an RFC process the project will accumulate ad-hoc design changes from contributors that drift the architecture. The RFC process is the gate that keeps the architectural-intent docs load-bearing.
Don't add new measurement scenarios beyond the current 19 before the elevation phases run. Scenario count growth without the structural-hygiene phases will widen rather than close the validation gap.

6. Honest unknowns¶

Whether the v0.4.0 design has hidden defects only mypy strict would surface. The 1,148 tests cover behavior; they don't catch type-level latent bugs. Stage 1 S1 will tell us.
Whether the OFAMIN-initialism + pillar-count mismatch is a cosmetic issue or a design issue. Gap D from the prior audit said it was structural-intent drift; my read above says it's cosmetic. Stage 4 L5 (RFC process) would force the question.
Whether L6 (validation studies) is feasible without a research partner. Cross-framework comparison + reproducibility studies are scientifically valuable but operationally heavy. May need external collaboration.
The proprietary license vs open-source decision. Today the LICENSE is Proprietary; many of the "legit" phases (DOI, community, contributor onboarding) assume open-source. This is an owner decision, not a framework decision.
Distribution channels. PyPI? Conda-forge? Private wheel server? All possible; each has different legitimacy implications.

7. License decision — RESOLVED (2026-05-16)¶

The previously-open question — Will Ophamin be open-source or stay proprietary? — was resolved by owner directive 2026-05-16: open-source under Apache License 2.0. Concrete consequences for the roadmap:

L1 docs site → public mkdocs (likely GitHub Pages, possibly a custom domain).
L2 DOI → Zenodo-mintable on the first tagged release; needs the repo to be GitHub-public.
L3 public CI → GitHub Actions matrix runs with public badges; branch protection on main.
L5 RFC process → community-shaped: outside contributors can propose RFCs through the standard PR flow; the existing architectural audit docs become RFC-0001 through RFC-0014 in retrospect.
L6 validation studies → eligible for academic-collaboration proposals; substrate-comparison studies against other frameworks (Garak, promptfoo, Frouros, …) become possible.

Effort: ~20–30 sessions for the full 4-stage path.

8. Naming decision — RESOLVED (2026-05-16)¶

The previously-deferred question — what to do about the OFAMIN initialism vs the now-broader pillar count — was resolved by owner directive 2026-05-16: do not rename Ophamin. The name is the stable identifier per NOTICE; the pillar count having outgrown the six letters is a historical-anchor situation, not a naming defect.

Concrete consequences:

Existing OFAMIN pillar identifiers (O.spc / O.srm / O.drift / A.sprt / M.mixed_effects / M.mea / I.cma / N.cross_validation) remain the canonical pillar names.
New pillars beyond the six get descriptive names rather than forcing them into the initialism (e.g. the proposed L.atency / B.andwidth / Availability / Σ.correlation pillars from the observational-surface doc would land as latency / bandwidth / availability / correlation rather than as new OFAMIN letters).
Diagnostic pillars (diagnostics.anticipatory / inertia / kernel_coupling) already use the descriptive shape; future diagnostic additions follow suit.
The name itself ("Ophamin") is reserved per NOTICE; architectural-divergence forks choose their own name.

Authored by Claude (Opus 4.7 1M context), 2026-05-16, after landing every open architectural gap (Moves A–N), cutting v0.4.0, and receiving owner-locked constraints (open-source + no rename). Ready to execute Stage 1 (S1 mypy strict + S2 coverage + S3 benchmarks) on owner go-ahead.

8.5. Stage 5 + 6 — execution status (refreshed 2026-05-18)¶

This table tracks per-phase shipped state. Phases land as one or more patch/minor releases; the canonical record is the CHANGELOG.

Phase	Theme	Status	Releases
E2	FWER correction at campaign level + CampaignRecord/2.0 schema bump	✅ shipped	`0.9.0`
E6	PyPI Trusted Publishing release workflow + advisory-until-PyPI-enabled	✅ shipped (one owner step pending)	`0.9.1`, `0.9.2`
E7	SLSA 3 build provenance + sigstore signing + PEP 740 attestations	✅ shipped	`0.9.3`
E8	Python-API stability contract (`@Stable` / `@Provisional` / `@Internal` / `@Deprecated` decorators + regression suite + `ophamin api-stability` CLI)	✅ shipped	`0.10.0`, `0.10.1`
E10	Community infrastructure (GOVERNANCE / ROADMAP / SUPPORT / FUNDING / CoC link)	✅ shipped	`0.10.2`
E4	Research-grade reproducibility — deterministic-seed audit scenario + framework-wide audit gate + `SOURCE_DATE_EPOCH` build reproducibility	✅ shipped (per-OS lockfiles + cross-machine diffoscope remain owner-side)	`0.11.0`–`0.11.2`, `0.11.4`
E3.1	Concept walkthroughs (E2 + E4 + E8 + E1 demos under `examples/walkthrough_*.py`)	✅ shipped	`0.11.3`, `0.12.1`
E1.1	First cross-framework validation — PyMC↔NumPyro Bayesian posterior agreement + signed proof published under `proofs/measurement_machinery/bayesian_cross_framework/`	✅ shipped	`0.12.0`
E1.2	Second cross-framework validation — Wilson CI scipy↔statsmodels (machine-epsilon agreement, signed proof under `proofs/measurement_machinery/wilson_ci_cross_framework/`)	✅ shipped	`0.13.0`
E1.3	Third cross-framework validation — Spearman ρ scipy↔pingouin (exact 0.000 agreement, signed proof under `proofs/measurement_machinery/spearman_cross_framework/`)	✅ shipped	`0.13.0`
E1.4	Fourth cross-framework validation — Pearson r scipy↔numpy↔pingouin (three-way, machine-epsilon agreement, signed proof under `proofs/measurement_machinery/pearson_cross_framework/`)	✅ shipped	`0.14.0`
E1.5	Fifth cross-framework validation — Welch's t-test scipy↔statsmodels↔pingouin (three-way, both t and two-sided p, ~8× machine epsilon agreement; signed proof under `proofs/measurement_machinery/welch_t_cross_framework/`)	✅ shipped	`0.14.0`
E1.6	Sixth cross-framework validation — one-way ANOVA scipy↔statsmodels↔pingouin (three-way, both F and two-sided p, ~32× machine epsilon agreement; signed proof under `proofs/measurement_machinery/anova_cross_framework/`)	✅ shipped	`0.15.0`
E1.7	Seventh cross-framework validation — Mann-Whitney U scipy↔pingouin (exact agreement on both U and p under matched continuity; first non-parametric check in the portfolio; signed proof under `proofs/measurement_machinery/mann_whitney_cross_framework/`)	✅ shipped	`0.15.0`
E1 acceptance	≥ 3 cross-framework VALIDATED proofs under `proofs/measurement_machinery/`	✅ met (7 / 3)	`0.13.0`–`0.15.0`
E9 spec	Canonical-form byte representation promoted to normative `SCHEMAS.md` §"Canonical-form determinism (normative)" R1–R11 + three cross-language test fixtures under `tests/canonical_form/` with HMAC pins	✅ shipped	`0.14.0`
E5 draft	JOSS-style methods paper draft authored under `paper/paper.md` + `paper/paper.bib` (~1500 words, 7 cross-framework agreement proofs tabulated)	✅ shipped (refreshed `0.15.0`)	`0.14.0`, `0.15.0`
E9.1 read-side	Rust `crates/ophamin-proof` (read-only verifier; serde_json + arbitrary_precision + custom escape_string per R6) + TS `packages/ophamin-proof-js` (custom JSON parser preserving int/float + Python-repr float formatter + ensure_ascii escape) + CI workflow `.github/workflows/cross-language.yml` running both fixture suites AND signature verification on every shipped Python-emitted signed proof	✅ shipped	`0.16.0`, `0.16.1`, `0.16.2`
E9.2 write-side	Canonical-form WRITERS in Rust + JS — `CanonicalValue::Object/Int/Float/...` + `canonicalize_bytes` + `sign_canonical` (Rust); `PyInt` wrapper + `canonicalize` + `signCanonical` (JS). 7 + 7 cross-language conformance fixtures pin byte-equality with Python emitter. Closes the "future" row from the 0.16.0 line.	✅ shipped	`0.21.0`, `0.21.1`, `0.21.2`
E9.3 MCP server	`ophamin mcp serve` — FastMCP server exposing 6 tools (`list_scenarios`, `get_scenario_claim`, `verify_proof`, `canonicalize_value`, `read_proof_index`, `run_scenario`). stdio + SSE + streamable-http transports. AI-agent interop path: any MCP client (Claude Code, Cursor, Cline) can drive Ophamin. Shared impls in `ophamin.interfaces._impls` reused by every other transport.	✅ shipped	`0.17.0`, `0.17.1`
E9.4 HTTP REST API	`ophamin http serve` — FastAPI app with 8 endpoints (`/health`, `/version`, `/scenarios`, `/claim`, `/verify`, `/canonicalize`, `/proofs/index`, `/run`). Auto-generated OpenAPI 3 spec at `/openapi.json`; Swagger UI at `/docs`; ReDoc at `/redoc`. Same shared impls.	✅ shipped	`0.18.0`
E9.5 CloudEvents wrapper	`ophamin.cloudevents.wrap(proof, source=...)` / `unwrap(envelope)` — wraps signed proofs in CloudEvents 1.0 structured-mode envelope for event-stream routing infrastructure (Kafka, EventBridge, Knative). Signature surface unchanged — the proof inside the envelope still verifies bit-equal.	✅ shipped	`0.19.0`
E9.6 OTel instrumentation	`ophamin.observability.setup_otel(otlp_endpoint=...)` — `OphaminInstrumentor` singleton + standard spans (`ophamin.scenario.run.*`, `ophamin.proof.verify`, `ophamin.canonical.encode`) + standard metrics (`ophamin_scenarios_run_total`, `ophamin_scenario_duration_seconds`, `ophamin_proofs_verified_total`, `ophamin_canonical_bytes_encoded`). Drop-in with Jaeger / Prometheus / Datadog / any OTLP-receiving backend.	✅ shipped	`0.20.0`
E9.7 fixture corpus extension	Cross-language canonical-form conformance grown from 3 → 5 fixtures. New `boundary_cases` (empty containers, control chars, JSON escape specials — R6 corners). New `deeply_nested` (4-level nested object tree, arrays-of-objects-of-arrays, recursive sort under R3). 21 Python + 12 JS + 10 Rust fixture-conformance tests.	✅ shipped	`0.24.0`
E9.8 end-to-end layer composition	`tests/test_interop_endtoend.py` — 11 end-to-end multi-layer tests pinning the "all five layers compose" promise (MCP ↔ HTTP ↔ CloudEvents ↔ OTel ↔ wire-format) with a single round-trip. Refuses behavioural drift between layers structurally.	✅ shipped	`0.24.0`
E5 update — paper interop section	`paper/paper.md` extended with `§Cross-host interoperability` describing all five interop layers + the cross-language wire-format round-trip. `paper/README.md` falsifiable-claims table grown from 8 → 12 rows. `paper/paper.bib` adds MCP / FastAPI / CloudEvents / OTel spec references.	✅ shipped (refreshed)	`0.23.0`
E5 owner-prep — INTEROP_OVERVIEW	New consolidated `docs/INTEROP_OVERVIEW.md` — single-page on-ramp covering every way to drive, consume, or observe Ophamin from outside Python. Decision tree by consumer shape + stability contract + cross-layer composition.	✅ shipped	`0.23.0`
E4 owner-prep — REPRODUCING.md	New `docs/REPRODUCING.md` — external-rebuild guide. 10-minute reproducer (clone + cross-language fixture verification + shipped proof end-to-end) + 1–2-hour full reproducer (matrix across Python + JS + Rust + `SOURCE_DATE_EPOCH` build reproducibility). Names what's still owner-side (diffoscope cross-machine + Zenodo + JOSS submission).	✅ shipped	`0.22.0`
E2 owner-prep — CITATION + Zenodo refresh	`CITATION.cff` + `.zenodo.json` refreshed to reflect the 0.16.x–0.21.x interop arc. New `related_identifiers` link Zenodo deposit to `SCHEMAS.md` (`isDocumentedBy`) + `paper/paper.md` (`isDescribedBy`).	✅ shipped	`0.22.0`
docs CI hygiene	`docs/INTEROP_OVERVIEW.md` + `docs/REPRODUCING.md` external `../path` links rewritten to absolute `https://github.com/IdirBenSlama/Ophamin/blob/main/...` URLs so mkdocs `--strict` accepts them.	✅ shipped	`0.24.1`
E3 owner-side	Zenodo benchmark deposit + DOI + reproducer notebooks for ≥ 6 scenarios	open (owner)	—
E4 owner-side	External reviewer rebuild verification (byte-equal SBOM + signed-record output)	open (owner)	—
E5 submission	Methods paper submission (JOSS / SoftwareX / JMLR-OSS) + reviewer-time feedback	open (owner: ORCID + venue + Zenodo DOI per `paper/README.md`)	—

Both 1.0.0 prerequisites met: wire-format stability contract (E2 — 0.9.0) and Python-API stability contract (E8 — 0.10.0). What separates the current 0.24.x line from 1.0.0 is external validation under real upgrade pressure — RFC 0002 §3.2 names "third party rebuilds a tagged release and verifies byte-equal SBOM + signed-record output" (E4 owner-side) and "methods paper passes review" (E5) as the two doors.

Between 0.16.0 and 0.24.1 the framework grew five interop layers (read-side Rust + JS verifier, write-side Rust + JS emitter, MCP server, HTTP REST API, CloudEvents 1.0 envelope, OTel instrumentation) on top of the same shared ophamin.interfaces._impls substrate. Every layer routes through one Python function; behavioural drift between layers is structurally impossible. The single-page on-ramp for any external consumer is docs/INTEROP_OVERVIEW.md.

9. Stage 5 — state-of-the-art scientific tier¶

Stage 5 raises Ophamin from "internally rigorous + publicly browsable" (Stages 1–4) to "citable as a scientific methods framework". Each phase has concrete acceptance criteria and a publication-shaped deliverable.

Phase E1 — cross-framework validation studies.

The framework's measurement-machinery tier is already self-validating (Bayesian posterior contracts as √N; CRDT laws cross-checked between pycrdt + y-py). Stage-5 extends this to cross-framework validation:

GWF-class scenarios cross-checked against Garak + promptfoo — run the same offensive-security corpus through both and surface the delta as a signed proof. The discipline: pre-register the expected agreement bound BEFORE running.
Pillar-class scenarios cross-checked against R / Stan / PyMC — the Bayesian-phi-posterior scenario already uses PyMC; add a Stan alternative under [bayesian_stan] and assert posterior contraction agrees within tolerance.
CRDT-laws cross-checked against Yjs's own JS test suite — translate one of Yjs's foundational tests into a Hypothesis strategy; verify Ophamin reaches the same fixed-point.

Acceptance: ≥ 3 cross-framework validation proofs published under proofs/measurement_machinery/; each is a VALIDATED record with a documented agreement threshold. Estimated effort: 3–5 sessions.

Phase E2 — formal statistical rigor.

Pre-registration discipline + Wilson 95 % CIs are already shipped. Stage-5 adds the statistical machinery a methods paper would expect:

Family-wise error rate (FWER) management across a campaign. When ophamin run-all produces N proofs, the probability of at least one spurious VALIDATED at α=0.05 climbs with N. Implement Holm–Bonferroni or Benjamini–Hochberg correction in the CampaignRecord aggregate; surface both raw and corrected verdicts.
Bayesian updating across a scenario sequence. When the same scenario runs N times against different commits, the posterior should update — not reset. Add ophamin compare update-posterior <scenario> that walks a directory of proofs and emits a posterior trajectory record.
Power analysis for every scenario at scenario-authoring time. Add ScenarioScore.minimum_detectable_effect or similar; warn at ophamin scenario show time when the configured N is below the power-80 threshold for the pre-registered claim.

Acceptance: a new pillar M.fwer (multiplicity correction) + a new E.power pillar (effect-size + power calculation). Both pass property tests. CampaignRecord schema bumps to 2.0 to include corrected verdicts.

Estimated effort: 2–4 sessions.

Phase E3 — open data + benchmarks.

A SOTA scientific framework publishes its benchmark corpus:

Substrate-vs-claim benchmark suite. Curate 100+ signed proofs across the 32 scenarios + N synthetic-substrate variants; publish as a tagged Zenodo deposit with its own DOI separate from the framework's. The benchmark is then citable as a dataset — downstream papers can reference it directly.
Reproducer notebook per scenario — a Jupyter or marimo notebook that, given the published benchmark, reproduces every scenario's claim end-to-end. Owner-runnable; the notebook is the scientific artefact.
Hardness landscape paper — a short methods paper showing how scenario verdicts shift as substrate parameters vary (e.g. MockSubstrate noise level → ImmuneSiege false-positive rate trajectory). Submitted to a software-paper venue (JOSS, SoftwareX, JMLR-Open-Source).

Acceptance: one Zenodo benchmark deposit + one methods paper draft + reproducer notebooks for ≥ 6 scenarios.

Estimated effort: 3–5 sessions + owner-side paper authoring.

Phase E4 — research-grade reproducibility.

Beyond Stages 1–3's reproducible-build foundation:

Per-OS lockfiles for every supported triple: macOS-arm64-py312, linux-amd64-py312, linux-amd64-py313, linux-arm64-py312 (when wheels catch up). Stage-3 shipped two; Stage-5 adds the rest via uv pip compile --universal or per-platform Docker emit.
Deterministic-seed propagation audit. Every scenario should produce a bit-identical proof_id for the same (seed, corpus, substrate_commit, ophamin_commit) tuple across N machines. Add a MeasurementMachinery scenario that asserts this empirically.
In-toto / SLSA Level 3 attestations for every release artefact. GitHub Actions natively supports SLSA 3 via the slsa-framework reusable workflow; wire it.
Container image signing via cosign (sigstore). The 0.8.x Dockerfile produces an image; cosign-sign it on every release.
Diffoscope-clean builds — diffoscope should report zero meaningful diffs between two independent builds of the same commit.

Acceptance: at least one external reviewer rebuilds a tagged release from source + lockfile and verifies the SBOM byte-equal + signed-record byte-equal output. Published as a reproducibility report.

Estimated effort: 2–4 sessions + owner-side cross-validation.

Phase E5 — peer review + publication.

The capstone: a methods paper. Likely venues + paths:

JOSS (Journal of Open Source Software) — fast turnaround, reviewer focus on "is the software well-engineered + documented" rather than novel research. Realistic 1–3 month review cycle.
SoftwareX — Elsevier, scopes broader than JOSS, asks for some research narrative + reproducibility evidence.
JMLR-Open-Source-Software — narrowest scope (ML/stats software); prestige bump.

Acceptance: paper draft + JOSS-style review issue opened at the chosen venue. The framework's claims become independently checkable by reviewer-time peer review.

Estimated effort: 3–6 sessions of authoring + revision rounds.

Stage 5 cumulative effort: 13–24 sessions.

10. Stage 6 — state-of-the-art engineering tier¶

Stage 6 raises Ophamin from "Apache-2.0 source on GitHub" to "on the shelf next to scikit-learn / mlflow / pymc — installable, signed, multi-platform, with formal stability guarantees".

Phase E6 — PyPI + conda-forge + multi-platform wheels.

Today the framework is GitHub-only; pip install ophamin doesn't work. Stage-6 phase one:

PyPI publication via Trusted Publishing. Add the OIDC config to .github/workflows/release.yml. Every v* tag triggers a sdist + pure-Python wheel build + upload. No long-lived PyPI tokens — the OIDC trust is the release credential.
conda-forge feedstock — separate repo conda-forge/ophamin-feedstock, the recipe meta.yaml derives from pyproject's extras. Conda users install via conda install -c conda-forge ophamin.
Multi-platform wheel matrix — for the future-when-Ophamin-grows- C-extensions case (it's pure-Python today, so a single wheel suffices); the scaffolding is cibuildwheel. Adds a workflow stub even though it's a no-op for the current code so future C extensions ship pre-built.

Acceptance: pip install ophamin works against PyPI; conda-forge recipe lands; both auto-update on every release.

Estimated effort: 1–2 sessions.

Phase E7 — signed releases + SLSA provenance.

Beyond Stage-5's research-reproducibility framing, the engineering SOTA requires:

Sigstore release signing — every PyPI artefact + container image cosign-signed; signatures published to a transparency log.
SLSA Level 3 build provenance — auto-generated by the release workflow; attached to every release as an in-toto attestation.
Source provenance — the GitHub commit signature chain back to the owner's GPG / SSH key. Verifiable via gh attestation verify.

Acceptance: cosign verify succeeds against every release artefact; gh attestation verify confirms the build provenance.

Estimated effort: 1 session + GitHub owner-side cosign setup.

Phase E8 — deprecation + stability policy.

Today the framework lives under 0.x; backward compatibility is encouraged but not contractually guaranteed beyond the schema policy in SCHEMAS.md. State-of-the-art engineering demands an explicit stability contract:

API stability tiers — every public symbol carries an explicit tier in its docstring: Stable, Provisional, Internal, Deprecated (until <date>). The audit pillar mypy enforces this via a @stable / @provisional / @internal decorator stack (also exposed via mkdocstrings).
Deprecation policy — at least one full minor release of warning
the documented migration path before removal in a major bump. Mirrors the SCHEMAS.md policy at the Python level.
Stability test suite — for every Stable symbol, a regression test pins its signature + behavior. Adding parameters is OK; removing or renaming fails the test.
ophamin api-stability check CLI — surfaces deprecation warnings reachable from a user's code (point at a directory or a proof record's reproduction.command).

Acceptance: every public symbol annotated; the stability test suite catches accidental renames at PR time.

Estimated effort: 2–3 sessions.

Phase E9 — cross-language interop.

Today Ophamin is Python-only. SOTA scientific frameworks have at minimum a Read API in adjacent languages. Concrete shape:

Rust signed-record codec (read-only) — a cargo crate ophamin-proof that parses EmpiricalProofRecord JSON, verifies the signature, and exposes the data structurally. Validates the schema is platform-agnostic; opens the door to Rust scenarios.
JS / TypeScript signed-record codec — same pattern, for the browser-side replay-a-proof story.
Schema stability via cross-language tests — the Rust + JS codecs run against a fixture of 100 signed proofs from Python; byte-equal signature verification across all three.

Acceptance: a crates/ophamin-proof/ + packages/ophamin-proof-js/ subdir; both ship their own CI; cross-language reproducibility test passes.

Estimated effort: 3–5 sessions.

Phase E10 — community infrastructure.

The last piece of the "alongside scikit-learn / mlflow / pymc" gap:

GitHub Discussions enabled — paired with the existing Issues templates.
Code of Conduct activated and visible.
Sponsor button — GitHub Sponsors / Open Collective. Not for fundraising; for signalling that the project accepts contributions at the community-economics layer.
Project governance doc — GOVERNANCE.md explaining the owner-as-BDFL state today + the path to a small core team if / when contributors arrive.
Annual roadmap doc — ROADMAP.md derived from this elevation roadmap, but refreshed each year. Shows external readers where the project is heading.
Quarterly state-of-Ophamin post — owner-territory; one short post per quarter showing what changed, what shipped, what's next. Doubles as a check-in artifact.

Acceptance: every infrastructure piece visible from https://github.com/IdirBenSlama/Ophamin.

Estimated effort: 1–2 sessions + ongoing owner cadence.

Stage 6 cumulative effort: 8–14 sessions.

11. Definition of "state-of-the-art" — what we measure against¶

The framing of Stages 5 + 6 is "be on the shelf next to scikit-learn / mlflow / pymc". Concrete criteria:

Criterion	scikit-learn	mlflow	pymc	Ophamin target after Stages 5+6
PyPI installable	✅	✅	✅	✅ Phase E6
Conda-forge available	✅	✅	✅	✅ Phase E6
Multi-platform CI matrix	Linux + macOS + Windows	Linux + macOS	Linux + macOS + Windows	Linux + macOS (Phase A2); Windows via E6 follow-up
mypy --strict clean	partial	partial	partial	✅ (Phase S1)
Property-test coverage	partial	partial	✅	✅ (Phase S6)
Methods paper published	n/a (textbook)	yes (Databricks blog series + workshop)	yes (JOSS)	Phase E5
DOI per release	no	no	✅ Zenodo	Phase L2 (owner-side activation)
Schema versioning policy	informal	yes (mlflow_model)	informal	✅ (Phase L4)
SLSA-signed releases	partial	yes	partial	Phase E7
Cross-language read API	C / R bindings	yes (JS / R / Java)	partial (R)	Phase E9
Public RFC process	yes	yes	yes	✅ (Phase L5)
Governance doc	✅	✅	✅	Phase E10
External contributors	thousands	hundreds	dozens	owner-territory
Citations	tens of thousands	thousands	thousands	owner-territory

What we can deliver via framework-internal work: every row up to "External contributors". What we can't deliver via code alone: the last two rows are owner-driven (paper writing, community building, conference talks). State-of-the-art means delivering everything that's framework-internal + having the infrastructure ready for the external-driven parts.

12. The next single highest-leverage move (after Stages 1–4)¶

If only ONE Stage-5 phase landed first, Phase E2 (formal statistical rigor — FWER correction across a campaign) is the highest leverage. Three reasons:

It's framework-internal — no owner-side dependency.
It surfaces a real defect in the current methodology: today, a ophamin run-all that produces 19 scenario verdicts has an ~62 % chance of at least one spurious VALIDATED at α=0.05 from pure multiple-testing. The fix is documented + implemented in a single round; the win is publishable as part of the methods paper.
It's the bridge between "rigorous internally" and "citable externally" — without it, a methods reviewer asks the multiple-testing question immediately.

Second-highest: Phase E1 (cross-framework validation). It surfaces exactly the dependencies that make Ophamin not-yet-SOTA today: the framework's claims about Kimera are testable only against itself.