stringforge.vacuavault#
stringforge.vacuavault — tooling for the HuggingFace vacua_vault
dataset repository.
This subpackage provides three CLI-driven operations for maintaining the remote vault repo:
validate (on PR branch): parse the git diff, re-validate added parquet files, post a catalog preview as a PR comment. Implemented in
ci.rebuild_catalog (post-merge on main): scan the repo for all parquet files, regenerate
catalog.parquetfrom their metadata. Implemented incatalog.curate (maintainer operation): promote a community submission out of
community/up to the model-level directory, stripping the{hf_username}_prefix and preserving attribution. Implemented incurate.
All three are dispatched by __main__ via argparse subcommands.
Module layout:
stringforge/vacuavault/
├── __init__.py — this file (re-exports public entry points)
├── __main__.py — CLI dispatcher
├── ci.py — validate_pr_diff()
├── catalog.py — rebuild_catalog()
├── curate.py — curate_submission()
└── schema.py — SCHEMA constants + _validate_parquet()
Server-side tooling for the HuggingFace vacua_vault dataset
repository. This subpackage holds the schema definitions,
validators, and CI helpers that govern community contributions to
the public vacua datasets. It has no downstream-package
dependencies — the moduli-stabilisation, identity-hashing, and
auto-load logic live in jaxvacua (or any sibling package
that consumes the vault).
The corresponding HuggingFace dataset URL is
aschachner/vacua_vault. The Python module name (with no
underscore) is intentionally distinct from the on-disk vault
folder name (vacua_vault/) to avoid PEP 420 namespace-package
shadowing.
Schema constants#
Module-level constants and regexes that define the parquet layout and filename conventions.
SCHEMA_VERSIONRESERVED_NAMESLABEL_SLUG_RE
Validation#
The user-facing validator runs schema, identity, and (optional)
physics checks against a single parquet file. Pure dependency
injection: callers wanting physics validation pass db= and
model_hash_fn= themselves; stringforge.vacuavault never
imports a downstream package.
validate_parquet_file()split_by_validation()
Server-side CI helpers#
Run via python -m stringforge.vacuavault {validate | rebuild_catalog | curate}
on the HF dataset repo. Schema-only by default; physics-aware
variants live in downstream-package wrappers (e.g. a
jaxvacua-vault CLI in jaxvacua that injects an
LCSDatabase-backed model loader).
validate_pr_diff()rebuild_catalog()curate_submission()
Typical usage#
Schema-only validation from Python:
from stringforge import vacuavault as vv
result = vv.validate_parquet_file(
"tdf/h12_2/ks_29_tri_0/SUSY_Nmax34.parquet",
physics_checks="off",
)
assert result["passed"], result["errors"]
Physics-aware validation (caller supplies the database):
from stringforge import vacuavault as vv
from stringforge.lcs_database import LCSDatabase
from stringforge.vacua_writer import _compute_model_hash
db = LCSDatabase(dataset="tdf")
result = vv.validate_parquet_file(
path,
db=db,
model_hash_fn=_compute_model_hash,
physics_checks="auto",
)
CLI (server-side, run from the HF dataset repo root):
python -m stringforge.vacuavault validate --base-branch main
python -m stringforge.vacuavault rebuild_catalog --repo-path .
python -m stringforge.vacuavault curate community/alice_dS_v2.parquet
See also#
stringforge.cy_io — the geometry-database I/O layer that
vacuavaultreads schema constants from.The dataset card at
vacua_vault/vacua_vault_dataset_card.mdfor the public-facing schema and contribution workflow.