schema#

Production parquet schema for the Vulcan tier.

Vulcan writes uniform, streaming-friendly parquet shards to a production HuggingFace repo. The schema is intentionally a strict superset of stringforge.vacuavault.schema.REQUIRED_COLUMNS so a Vulcan shard is loadable by the existing vault tooling after column projection – this is the compatibility hook that enables a future Vulcan -> VacuaWriter promotion step.

The schema is frozen at SCHEMA_VERSION. Forward-compatible additions (new optional columns) are allowed; column removals or type changes require a version bump.

stringforge.vulcan.schema.CERT_COLUMNS: Tuple[str, ...] = ('verifier_id', 'cert_status', 'cert_checks', 'kahler_metric_min_eig', 'hessian_min_eig', 'mass_min_eig', 'record_id')#

Certification columns (the “certified-vacuum record” contract).

These pin which verifier certified a row and what it found, so a later verifier-bug fix can select exactly the affected records and transition their status rather than silently leaving corrupt vacua published. Motivated by the 2026-06-02 incident (worklog/ERRORS.md): the Kähler-metric positive-definiteness check was once omitted and ~465 saddle points were miscounted as vacua. Without verifier_id that class of bug is unrecoverable at community scale.

  • verifier_id – content hash naming the verifier (see stringforge.vulcan.verifier); the join key for “which rows did this verifier certify?”.

  • cert_status – one of CERT_STATUSES. A verifier-bug fix transitions a record (never deletes it), mirroring the vacua_vault retractions.parquet pattern.

  • cert_checks – JSON dict of per-check pass/fail booleans for the physicality cascade (runaway, dilaton, cone, metric-PD, plus the stability min-eigs). Makes the 2026-06-02 signature (“passed cone but failed metric-PD”) queryable.

  • kahler_metric_min_eig – the number whose absence was the bug; storing it makes a recurrence detectable in aggregate.

  • hessian_min_eig / mass_min_eig – distinguish a genuine minimum from a saddle.

  • record_id – unique row key sha(geometry_id rounded-flux verifier_id) for retraction targeting. Physical dedup of a vacuum uses the verifier-independent (geometry_id, rounded-flux) separately.

stringforge.vulcan.schema.CERT_STATUSES: Tuple[str, ...] = ('certified', 'provisional', 'invalidated')#

Recognised values of the cert_status column. certified = passed the full cascade under its verifier_id; provisional = community-submitted, awaiting independent server-side re-cert; invalidated = a later verifier-bug fix demoted it.

stringforge.vulcan.schema.GEOMETRY_KEYS: Tuple[str, ...] = ('h11', 'h12', 'ks_id', 'triang_id', 'conifold_id', 'cicy_id')#

Geometry keys, in canonical order, used to derive geometry_id. Stable ordering is critical: the hash must be reproducible across cluster nodes and across stringforge versions. h11, h12 are interpreted in mirror (jaxvacua / lcs_tree) convention, matching LCSDatabase.

stringforge.vulcan.schema.GEOMETRY_KEY_SENTINEL: int = -1#

Sentinel for geometry-key fields when not applicable to the model (e.g. a CICY shard has ks_id = triang_id = -1). Matches the convention used by stringforge.vacua_writer._extract_model_identity().

stringforge.vulcan.schema.OPTIONAL_COLUMNS: Tuple[str, ...] = ('h11', 'h12', 'ks_id', 'triang_id', 'conifold_id', 'cicy_id', 'W_re', 'W_im', 'F_terms_re', 'F_terms_im', 'g_s', 'mass2', 'm_gravitino', 'residual', 'n_iterations', 'is_susy', 'is_isd', 'solver_name', 'solver_config_hash', 'git_sha', 'seed', 'wall_clock_s', 'tags', 'extra_data', 'verifier_id', 'cert_status', 'cert_checks', 'kahler_metric_min_eig', 'hessian_min_eig', 'mass_min_eig', 'record_id')#

Optional columns recognised by the schema. Shards may carry any subset; downstream consumers must tolerate missing optionals.

stringforge.vulcan.schema.REQUIRED_COLUMNS: Tuple[str, ...] = ('flux', 'moduli_re', 'moduli_im', 'tau_re', 'tau_im', 'run_id', 'geometry_id', 'tadpole_charge', 'created_at')#

Required columns on every Vulcan shard. Strict superset of the vault floor; the extras are the production telemetry every cluster write must carry.

stringforge.vulcan.schema.SCHEMA_VERSION: int = 1#

Vulcan production schema version. Bump on any non-additive change.

stringforge.vulcan.schema.canonical_flux(flux)#

Coerce a flux vector to a canonical tuple of Python ints.

Used as the verifier-independent component of both the physical dedup key (geometry_id, canonical_flux) and the per-row record_id(). Rounding to int is intentional: the integer flux quanta are the physical identity of a vacuum, independent of any floating-point representation a producer happened to use.

Parameters:

flux (Iterable[Any]) – An iterable of integer-valued flux components.

Returns:

Tuple[int, …] – The flux as a tuple of rounded Python ints.

Return type:

Tuple[int, ...]

stringforge.vulcan.schema.geometry_id(geometry)#

Compute a stable, content-addressed identifier for a Calabi-Yau model.

The identifier is the first 16 hex characters of sha1(canonical-json(geometry-keys)). Sufficient as a join key against a Vulcan catalogue; not intended as a cryptographic digest. Stable across processes, machines, and Python runs. Missing keys are filled with GEOMETRY_KEY_SENTINEL so a sparse mapping yields the same id as one that lists every key with explicit sentinels.

Note

geometry MUST be a server-canonicalised index dict (mirror convention; produced by the canonical identity path), never derived per-row from untrusted submitter input – otherwise the same physical geometry can hash two ways and dedup silently fails. An absent key maps to the sentinel -1 (“not applicable”); an explicit 0 (e.g. conifold_id=0, the conifold-limit model) is a genuinely distinct value, NOT normalised to the sentinel.

Parameters:

geometry (Mapping[str, Any]) – Mapping carrying any subset of GEOMETRY_KEYS.

Returns:

str – 16-hex-character identifier.

Return type:

str

stringforge.vulcan.schema.is_safe_run_id(run_id)#

Predicate: run_id is safe to use as a parquet filename component on POSIX, as an HF path segment, and as a vault label.

Reuses stringforge.vacuavault.schema.LABEL_SLUG_RE so a Vulcan run_id is admissible as a vault label after a future promotion step. Run-ids containing '__' are rejected because that sequence is reserved by the staging-area encoding (writer.py replaces '/' with '__' to produce local filenames).

Parameters:

run_id (str) – Candidate identifier.

Returns:
  • boolTrue if and only if run_id is a string matching

  • ``LABEL_SLUG_RE`` and does not contain ``’__’``.

Return type:

bool

stringforge.vulcan.schema.metric_pd_passed(cert_checks_json, min_eig)#

Positive assertion that a row’s own evidence confirms a positive-definite Kähler metric: cert_checks['kahler_metric_pd'] is True and a finite kahler_metric_min_eig > 0.

This is the positive form, and the distinction is load-bearing: the 2026-06-02 incident was an omitted check (no kahler_metric_pd key, min_eig left NaN), not a check that explicitly failed. A negative test (“is it explicitly False or <= 0?”) would let an omitted check pass; this predicate treats a missing key, an unparseable cert_checks, or a non-finite / non-positive eigenvalue as not passed. Both the community ingest gate and the snapshot citation gate use it, so the two agree by construction.

Parameters:
  • cert_checks_json (Any) – The row’s cert_checks value (a JSON string; may be None).

  • min_eig (Any) – The row’s kahler_metric_min_eig value.

Returns:

boolTrue iff the row positively asserts a PD Kähler metric.

Return type:

bool

stringforge.vulcan.schema.pyarrow_schema(extra_columns=None)#

Return the canonical pyarrow schema for a Vulcan shard.

Required columns appear in REQUIRED_COLUMNS order; the recognised optionals follow. extra_columns lets a caller extend the schema with project-specific fields without modifying the canonical layout.

Parameters:

extra_columns (Optional[Mapping[str, DataType]]) – Optional mapping {name: pyarrow_type} of project-specific extension columns. Names must not collide with canonical fields.

Returns:
  • pa.Schema – The full schema, ready to pass to

  • :func:`pyarrow.parquet.write_table`.

Raises:

ValueError – If an entry in extra_columns shadows a canonical field name.

Return type:

Schema

stringforge.vulcan.schema.record_id(geometry, flux, verifier_id)#

Compute the unique row key of a certified-vacuum record.

The key is sha1 of geometry_id canonical-flux verifier_id, truncated to 24 hex characters. It is the retraction-targeting handle: re-certifying the same physical vacuum under a new verifier_id produces a different record_id (a new row, preserving history) – so this key must NOT be used for physical dedup. Physical dedup uses the verifier-independent (geometry_id, canonical_flux(flux)) pair instead.

Parameters:
  • geometry (Mapping[str, Any]) – Geometry mapping (see geometry_id()).

  • flux (Iterable[Any]) – The integer flux vector of the vacuum.

  • verifier_id (str) – The certifying verifier’s content hash (see stringforge.vulcan.verifier).

Returns:

str – 24-hex-character unique row identifier.

Return type:

str

stringforge.vulcan.schema.shard_key(geometry, bucket_bits=4)#

Return the (h12, h11, bucket) key used to organise shards on disk.

bucket is drawn from the top bucket_bits of sha1(geometry_id). Bucketing prevents a single (h11, h12) point from accumulating an unmanageably large directory once many geometries are present.

Parameters:
Returns:

Tuple[int, int, int](h12, h11, bucket).

Return type:

Tuple[int, int, int]

stringforge.vulcan.schema.shard_relpath(geometry, run_id, bucket_bits=4)#

Canonical HF-repo-relative path for a Vulcan shard.

Layout: h12_{H}/h11_{H}/bucket_{B}/run_{run_id}.parquet.

Parameters:
Returns:

str – Forward-slash-separated path inside the repo.

Return type:

str

stringforge.vulcan.schema.validate_cert_status(status)#

Validate a cert_status value against CERT_STATUSES.

Parameters:

status (str) – The candidate status string.

Raises:

ValueError – If status is not a recognised certification status.

Return type:

None

stringforge.vulcan.schema.validate_schema(df, *, warn_unknown=True)#

Validate that df satisfies the Vulcan schema.

Aggregates every problem into a single ValueError – callers see all schema violations, not just the first one. Only cheap structural checks are performed (column presence, run-id slug shape, homogeneity of geometry_id); no row-level numerical validation.

Parameters:

df (DataFrame) – Coerced DataFrame ready to be written as a shard.

Raises:

ValueError – If any structural invariant is violated. The message names every detected problem.

Return type:

None

stringforge.vulcan.schema.vault_floor_compatible(df)#

Test whether df carries every column required by the curated vault layer (stringforge.vacuavault.schema.REQUIRED_COLUMNS).

Used as a precondition for a future Vulcan-to-VacuaWriter promotion step.

Parameters:

df (DataFrame) – A coerced or freshly-loaded Vulcan shard.

Returns:
  • boolTrue if df is loadable by the vault tooling

  • after column projection.

Return type:

bool