schema#
Production parquet schema for the Vulcan tier.
Vulcan writes uniform, streaming-friendly parquet shards to a
production HuggingFace repo. The schema is intentionally a strict
superset of stringforge.vacuavault.schema.REQUIRED_COLUMNS so a
Vulcan shard is loadable by the existing vault tooling after column
projection – this is the compatibility hook that enables a future
Vulcan -> VacuaWriter promotion step.
The schema is frozen at SCHEMA_VERSION. Forward-compatible
additions (new optional columns) are allowed; column removals or type
changes require a version bump.
- stringforge.vulcan.schema.CERT_COLUMNS: Tuple[str, ...] = ('verifier_id', 'cert_status', 'cert_checks', 'kahler_metric_min_eig', 'hessian_min_eig', 'mass_min_eig', 'record_id')#
Certification columns (the “certified-vacuum record” contract).
These pin which verifier certified a row and what it found, so a later verifier-bug fix can select exactly the affected records and transition their status rather than silently leaving corrupt vacua published. Motivated by the 2026-06-02 incident (
worklog/ERRORS.md): the Kähler-metric positive-definiteness check was once omitted and ~465 saddle points were miscounted as vacua. Withoutverifier_idthat class of bug is unrecoverable at community scale.verifier_id– content hash naming the verifier (seestringforge.vulcan.verifier); the join key for “which rows did this verifier certify?”.cert_status– one ofCERT_STATUSES. A verifier-bug fix transitions a record (never deletes it), mirroring thevacua_vaultretractions.parquetpattern.cert_checks– JSON dict of per-check pass/fail booleans for the physicality cascade (runaway, dilaton, cone, metric-PD, plus the stability min-eigs). Makes the 2026-06-02 signature (“passed cone but failed metric-PD”) queryable.kahler_metric_min_eig– the number whose absence was the bug; storing it makes a recurrence detectable in aggregate.hessian_min_eig/mass_min_eig– distinguish a genuine minimum from a saddle.record_id– unique row keysha(geometry_id ‖ rounded-flux ‖ verifier_id)for retraction targeting. Physical dedup of a vacuum uses the verifier-independent(geometry_id, rounded-flux)separately.
- stringforge.vulcan.schema.CERT_STATUSES: Tuple[str, ...] = ('certified', 'provisional', 'invalidated')#
Recognised values of the
cert_statuscolumn.certified= passed the full cascade under itsverifier_id;provisional= community-submitted, awaiting independent server-side re-cert;invalidated= a later verifier-bug fix demoted it.
- stringforge.vulcan.schema.GEOMETRY_KEYS: Tuple[str, ...] = ('h11', 'h12', 'ks_id', 'triang_id', 'conifold_id', 'cicy_id')#
Geometry keys, in canonical order, used to derive
geometry_id. Stable ordering is critical: the hash must be reproducible across cluster nodes and across stringforge versions.h11,h12are interpreted in mirror (jaxvacua / lcs_tree) convention, matchingLCSDatabase.
- stringforge.vulcan.schema.GEOMETRY_KEY_SENTINEL: int = -1#
Sentinel for geometry-key fields when not applicable to the model (e.g. a CICY shard has
ks_id = triang_id = -1). Matches the convention used bystringforge.vacua_writer._extract_model_identity().
- stringforge.vulcan.schema.OPTIONAL_COLUMNS: Tuple[str, ...] = ('h11', 'h12', 'ks_id', 'triang_id', 'conifold_id', 'cicy_id', 'W_re', 'W_im', 'F_terms_re', 'F_terms_im', 'g_s', 'mass2', 'm_gravitino', 'residual', 'n_iterations', 'is_susy', 'is_isd', 'solver_name', 'solver_config_hash', 'git_sha', 'seed', 'wall_clock_s', 'tags', 'extra_data', 'verifier_id', 'cert_status', 'cert_checks', 'kahler_metric_min_eig', 'hessian_min_eig', 'mass_min_eig', 'record_id')#
Optional columns recognised by the schema. Shards may carry any subset; downstream consumers must tolerate missing optionals.
- stringforge.vulcan.schema.REQUIRED_COLUMNS: Tuple[str, ...] = ('flux', 'moduli_re', 'moduli_im', 'tau_re', 'tau_im', 'run_id', 'geometry_id', 'tadpole_charge', 'created_at')#
Required columns on every Vulcan shard. Strict superset of the vault floor; the extras are the production telemetry every cluster write must carry.
- stringforge.vulcan.schema.SCHEMA_VERSION: int = 1#
Vulcan production schema version. Bump on any non-additive change.
- stringforge.vulcan.schema.canonical_flux(flux)#
Coerce a flux vector to a canonical tuple of Python ints.
Used as the verifier-independent component of both the physical dedup key
(geometry_id, canonical_flux)and the per-rowrecord_id(). Rounding to int is intentional: the integer flux quanta are the physical identity of a vacuum, independent of any floating-point representation a producer happened to use.
- stringforge.vulcan.schema.geometry_id(geometry)#
Compute a stable, content-addressed identifier for a Calabi-Yau model.
The identifier is the first 16 hex characters of
sha1(canonical-json(geometry-keys)). Sufficient as a join key against a Vulcan catalogue; not intended as a cryptographic digest. Stable across processes, machines, and Python runs. Missing keys are filled withGEOMETRY_KEY_SENTINELso a sparse mapping yields the same id as one that lists every key with explicit sentinels.Note
geometryMUST be a server-canonicalised index dict (mirror convention; produced by the canonical identity path), never derived per-row from untrusted submitter input – otherwise the same physical geometry can hash two ways and dedup silently fails. An absent key maps to the sentinel-1(“not applicable”); an explicit0(e.g.conifold_id=0, the conifold-limit model) is a genuinely distinct value, NOT normalised to the sentinel.- Parameters:
geometry (
Mapping[str,Any]) – Mapping carrying any subset ofGEOMETRY_KEYS.- Returns:
str – 16-hex-character identifier.
- Return type:
- stringforge.vulcan.schema.is_safe_run_id(run_id)#
Predicate:
run_idis safe to use as a parquet filename component on POSIX, as an HF path segment, and as a vault label.Reuses
stringforge.vacuavault.schema.LABEL_SLUG_REso a Vulcan run_id is admissible as a vault label after a future promotion step. Run-ids containing'__'are rejected because that sequence is reserved by the staging-area encoding (writer.pyreplaces'/'with'__'to produce local filenames).
- stringforge.vulcan.schema.metric_pd_passed(cert_checks_json, min_eig)#
Positive assertion that a row’s own evidence confirms a positive-definite Kähler metric:
cert_checks['kahler_metric_pd'] is Trueand a finitekahler_metric_min_eig > 0.This is the positive form, and the distinction is load-bearing: the 2026-06-02 incident was an omitted check (no
kahler_metric_pdkey,min_eigleftNaN), not a check that explicitly failed. A negative test (“is it explicitly False or<= 0?”) would let an omitted check pass; this predicate treats a missing key, an unparseablecert_checks, or a non-finite / non-positive eigenvalue as not passed. Both the community ingest gate and the snapshot citation gate use it, so the two agree by construction.
- stringforge.vulcan.schema.pyarrow_schema(extra_columns=None)#
Return the canonical pyarrow schema for a Vulcan shard.
Required columns appear in
REQUIRED_COLUMNSorder; the recognised optionals follow.extra_columnslets a caller extend the schema with project-specific fields without modifying the canonical layout.- Parameters:
extra_columns (
Optional[Mapping[str,DataType]]) – Optional mapping{name: pyarrow_type}of project-specific extension columns. Names must not collide with canonical fields.- Returns:
pa.Schema – The full schema, ready to pass to
:func:`pyarrow.parquet.write_table`.
- Raises:
ValueError – If an entry in
extra_columnsshadows a canonical field name.- Return type:
Schema
- stringforge.vulcan.schema.record_id(geometry, flux, verifier_id)#
Compute the unique row key of a certified-vacuum record.
The key is
sha1ofgeometry_id ‖ canonical-flux ‖ verifier_id, truncated to 24 hex characters. It is the retraction-targeting handle: re-certifying the same physical vacuum under a newverifier_idproduces a differentrecord_id(a new row, preserving history) – so this key must NOT be used for physical dedup. Physical dedup uses the verifier-independent(geometry_id, canonical_flux(flux))pair instead.- Parameters:
- Returns:
str – 24-hex-character unique row identifier.
- Return type:
- stringforge.vulcan.schema.shard_key(geometry, bucket_bits=4)#
Return the
(h12, h11, bucket)key used to organise shards on disk.bucketis drawn from the topbucket_bitsofsha1(geometry_id). Bucketing prevents a single(h11, h12)point from accumulating an unmanageably large directory once many geometries are present.
- stringforge.vulcan.schema.shard_relpath(geometry, run_id, bucket_bits=4)#
Canonical HF-repo-relative path for a Vulcan shard.
Layout:
h12_{H}/h11_{H}/bucket_{B}/run_{run_id}.parquet.- Parameters:
geometry (
Mapping[str,Any]) – Geometry mapping (seegeometry_id()).run_id (
str) – Slug-safe identifier (seeis_safe_run_id()).bucket_bits (
int) – Width of the bucket prefix.
- Returns:
str – Forward-slash-separated path inside the repo.
- Return type:
- stringforge.vulcan.schema.validate_cert_status(status)#
Validate a
cert_statusvalue againstCERT_STATUSES.- Parameters:
status (
str) – The candidate status string.- Raises:
ValueError – If
statusis not a recognised certification status.- Return type:
- stringforge.vulcan.schema.validate_schema(df, *, warn_unknown=True)#
Validate that
dfsatisfies the Vulcan schema.Aggregates every problem into a single
ValueError– callers see all schema violations, not just the first one. Only cheap structural checks are performed (column presence, run-id slug shape, homogeneity ofgeometry_id); no row-level numerical validation.- Parameters:
df (
DataFrame) – Coerced DataFrame ready to be written as a shard.- Raises:
ValueError – If any structural invariant is violated. The message names every detected problem.
- Return type:
- stringforge.vulcan.schema.vault_floor_compatible(df)#
Test whether
dfcarries every column required by the curated vault layer (stringforge.vacuavault.schema.REQUIRED_COLUMNS).Used as a precondition for a future Vulcan-to-VacuaWriter promotion step.
- Parameters:
df (
DataFrame) – A coerced or freshly-loaded Vulcan shard.- Returns:
bool –
Trueifdfis loadable by the vault toolingafter column projection.
- Return type: