reader#
Read path for Vulcan: catalogue scan, lazy shard fetch, and query.
The reader builds a lightweight catalogue from parquet kv-metadata
(see stringforge.vulcan.writer.PROVENANCE_METADATA_KEY) and
caches loaded shards in a small LRU. Two backends are supported:
Local – the staging directory’s
synced/subtree on the current host. No network calls, no authentication. Used by the same workflow that produced the shards.HuggingFace – the production repo published by
stringforge.vulcan.Vulcan.sync(). Shard manifests are fetched on-demand viahuggingface_hub.HfApi.list_repo_files()and individual parquet files viahuggingface_hub.hf_hub_download().
The reader complements writer /
sync – it is the inverse traversal across
the same on-disk layout.
- stringforge.vulcan.reader.CATALOG_COLUMNS: Tuple[str, ...] = ('path_in_repo', 'run_id', 'geometry_id', 'h11', 'h12', 'ks_id', 'triang_id', 'conifold_id', 'cicy_id', 'n_rows', 'schema_version')#
Columns produced by
VulcanReader.catalog()(one row per shard). This is the read-side catalogue contract; downstream code can rely on these columns being present whenever the catalogue scan succeeds.
- class stringforge.vulcan.reader.VulcanReader(*, source, cache_size=32, token=None)#
Bases:
objectRead-side counterpart to
stringforge.vulcan.Vulcan.Builds a Vulcan-shard catalogue by scanning either a local
synced/tree or a HuggingFace dataset repo. Loaded shards are cached in a small in-memory LRU.A typical local usage pattern:
reader = VulcanReader.from_local(forge.staging_dir) cat = reader.catalog() susy = reader.query(h12=2, solver_name="newton")
A HuggingFace usage pattern:
reader = VulcanReader.from_hf("user/vacua_forge") run = reader.fetch_run("2026-06-01-120000_fluxscan_...")
- Parameters:
- catalog(*, refresh=False, strict=False)#
Return a one-row-per-shard catalogue of the configured source.
The first call scans every shard’s kv-metadata and caches the result; subsequent calls return the cached view. Pass
refresh=Trueto force a rescan – useful after async()adds new shards upstream.Shard metadata is read in parallel via a thread pool because the per-shard cost is dominated by I/O (opening the parquet file and reading its footer); ordering is irrelevant because the final DataFrame is sorted by
path_in_repo.Unreadable shards (corrupt file, missing footer, etc.) are skipped with a
RuntimeWarning; the number of skipped shards is exposed onresult.attrs["n_skipped"].Shards whose embedded
schema_versiondiffers from the currentstringforge.vulcan.schema.SCHEMA_VERSIONemit aRuntimeWarningper shard. Whenstrict=Truesuch a mismatch is promoted toValueError.- Parameters:
refresh (
bool) – Force a fresh scan even if a cached catalogue exists.strict (
bool) – WhenTrue, raiseValueErroron any shard whoseschema_versiondiffers fromstringforge.vulcan.schema.SCHEMA_VERSION.
- Returns:
pd.DataFrame – One row per shard with columns from
:data:`CATALOG_COLUMNS`.
- Raises:
ValueError – When
strict=Trueand a shard’sschema_versiondoes not matchstringforge.vulcan.schema.SCHEMA_VERSION.- Return type:
DataFrame
- fetch_run(run_id)#
Load every row produced by a specific
run_id.- Parameters:
run_id (
str) – The identifier carried on each row of the target shards.- Returns:
pd.DataFrame – Concatenated rows. Empty when no shard
matches.
- Return type:
DataFrame
- fetch_shard(path_in_repo)#
Load (and cache) a single shard by its HF-repo-relative path.
- Parameters:
path_in_repo (
str) – The path produced bystringforge.vulcan.schema.shard_relpath().- Returns:
pd.DataFrame – The shard’s rows.
- Raises:
FileNotFoundError – When the shard is absent both locally and (where applicable) on the Hub.
- Return type:
DataFrame
- classmethod from_hf(repo, *, revision=None, cache_size=32, token=None)#
Construct a reader that streams shards from a HuggingFace dataset repo on demand.
- Parameters:
repo (
str) –"user/repo"on the Hub.revision (
Optional[str]) – Optional branch, tag, or commit SHA to pin reads to. Pass a frozen-snapshot tag (e.g."v2026.06", produced bystringforge.vulcan.snapshot.freeze_snapshot()) to read an immutable, citeable revision;Nonereads the rolling default branch.cache_size (
int) – LRU capacity.token (
Optional[str]) – HuggingFace read token; falls back toHF_TOKEN.
- Returns:
VulcanReader – A reader bound to the remote repo (and revision).
- Return type:
- classmethod from_local(staging_dir, *, cache_size=32)#
Construct a reader that scans
staging_dir/synced/for already-committed shards.- Parameters:
- Returns:
VulcanReader – A reader bound to the local synced tree.
- Return type:
- query(*, run_id=None, geometry_id=None, h11=None, h12=None, ks_id=None, triang_id=None, solver_name=None, is_susy=None, is_isd=None, columns=None)#
Filter Vulcan rows by catalogue + row-level predicates.
Catalogue predicates (
h11,h12,ks_id,triang_id,run_id,geometry_id) narrow the set of shards that need to be read. Row-level predicates (solver_name,is_susy,is_isd) apply post-load.columnsprojects the result down to a subset of columns before returning.h11,h12are interpreted in mirror (jaxvacua /lcs_tree) convention, matchingstringforge.lcs.LCSDatabase.- Parameters:
- Returns:
pd.DataFrame – Concatenated rows from matching shards.
Empty when no shard satisfies the catalogue predicates.
- Return type:
DataFrame
- read_columns(path_in_repo, columns)#
Read only the named columns from one shard, skipping any that the shard does not carry.
A columnar read used by snapshot manifest generation: tallying
verifier_id/cert_statusover a large dataset must not pay the cost of materialising every column. Does not populate the LRU (it returns a column projection, not the full shard).