reader

reader#

Read path for Vulcan: catalogue scan, lazy shard fetch, and query.

The reader builds a lightweight catalogue from parquet kv-metadata (see stringforge.vulcan.writer.PROVENANCE_METADATA_KEY) and caches loaded shards in a small LRU. Two backends are supported:

Local – the staging directory’s synced/ subtree on the current host. No network calls, no authentication. Used by the same workflow that produced the shards.
HuggingFace – the production repo published by stringforge.vulcan.Vulcan.sync(). Shard manifests are fetched on-demand via huggingface_hub.HfApi.list_repo_files() and individual parquet files via huggingface_hub.hf_hub_download().

The reader complements writer / sync – it is the inverse traversal across the same on-disk layout.

stringforge.vulcan.reader.CATALOG_COLUMNS: Tuple[str, ...] = ('path_in_repo', 'run_id', 'geometry_id', 'h11', 'h12', 'ks_id', 'triang_id', 'conifold_id', 'cicy_id', 'n_rows', 'schema_version')#: Columns produced by VulcanReader.catalog() (one row per shard). This is the read-side catalogue contract; downstream code can rely on these columns being present whenever the catalogue scan succeeds.

stringforge.vulcan.reader.DEFAULT_CACHE_SIZE: int = 32#: Default LRU cache size, in shards.

class stringforge.vulcan.reader.VulcanReader(*, source, cache_size=32, token=None)#

Bases: object

Read-side counterpart to stringforge.vulcan.Vulcan.

Builds a Vulcan-shard catalogue by scanning either a local synced/ tree or a HuggingFace dataset repo. Loaded shards are cached in a small in-memory LRU.

A typical local usage pattern:

reader = VulcanReader.from_local(forge.staging_dir)
cat    = reader.catalog()
susy   = reader.query(h12=2, solver_name="newton")

A HuggingFace usage pattern:

reader = VulcanReader.from_hf("user/vacua_forge")
run    = reader.fetch_run("2026-06-01-120000_fluxscan_...")

Parameters:

source (_Source) – Internal source tag (use the factories instead).
cache_size (int) – LRU capacity in shards.
token (Optional[str]) – HuggingFace read token; falls back to HF_TOKEN. Ignored for local sources.

catalog(*, refresh=False, strict=False)#

Return a one-row-per-shard catalogue of the configured source.

The first call scans every shard’s kv-metadata and caches the result; subsequent calls return the cached view. Pass refresh=True to force a rescan – useful after a sync() adds new shards upstream.

Shard metadata is read in parallel via a thread pool because the per-shard cost is dominated by I/O (opening the parquet file and reading its footer); ordering is irrelevant because the final DataFrame is sorted by path_in_repo.

Unreadable shards (corrupt file, missing footer, etc.) are skipped with a RuntimeWarning; the number of skipped shards is exposed on result.attrs["n_skipped"].

Shards whose embedded schema_version differs from the current stringforge.vulcan.schema.SCHEMA_VERSION emit a RuntimeWarning per shard. When strict=True such a mismatch is promoted to ValueError.

Parameters:

refresh (bool) – Force a fresh scan even if a cached catalogue exists.
strict (bool) – When True, raise ValueError on any shard whose schema_version differs from stringforge.vulcan.schema.SCHEMA_VERSION.

Returns:

pd.DataFrame – One row per shard with columns from
:data:`CATALOG_COLUMNS`.

Raises:

ValueError – When strict=True and a shard’s schema_version does not match stringforge.vulcan.schema.SCHEMA_VERSION.

Return type:

DataFrame

clear_cache()#

Drop every cached shard from the in-memory LRU.

Return type:: None

fetch_run(run_id)#

Load every row produced by a specific run_id.

Parameters:

run_id (str) – The identifier carried on each row of the target shards.

Returns:

pd.DataFrame – Concatenated rows. Empty when no shard
matches.

Return type:

DataFrame

fetch_shard(path_in_repo)#

Load (and cache) a single shard by its HF-repo-relative path.

Parameters:: path_in_repo (str) – The path produced by stringforge.vulcan.schema.shard_relpath().
Returns:: pd.DataFrame – The shard’s rows.
Raises:: FileNotFoundError – When the shard is absent both locally and (where applicable) on the Hub.
Return type:: DataFrame

classmethod from_hf(repo, *, revision=None, cache_size=32, token=None)#

Construct a reader that streams shards from a HuggingFace dataset repo on demand.

Parameters:

repo (str) – "user/repo" on the Hub.
revision (Optional[str]) – Optional branch, tag, or commit SHA to pin reads to. Pass a frozen-snapshot tag (e.g. "v2026.06", produced by stringforge.vulcan.snapshot.freeze_snapshot()) to read an immutable, citeable revision; None reads the rolling default branch.
cache_size (int) – LRU capacity.
token (Optional[str]) – HuggingFace read token; falls back to HF_TOKEN.

Returns:

VulcanReader – A reader bound to the remote repo (and revision).

Return type:

VulcanReader

classmethod from_local(staging_dir, *, cache_size=32)#

Construct a reader that scans staging_dir/synced/ for already-committed shards.

Parameters:

staging_dir (Union[str, Path]) – Vulcan staging root.
cache_size (int) – LRU capacity.

Returns:

VulcanReader – A reader bound to the local synced tree.

Return type:

VulcanReader

query(*, run_id=None, geometry_id=None, h11=None, h12=None, ks_id=None, triang_id=None, solver_name=None, is_susy=None, is_isd=None, columns=None)#

Filter Vulcan rows by catalogue + row-level predicates.

Catalogue predicates (h11, h12, ks_id, triang_id, run_id, geometry_id) narrow the set of shards that need to be read. Row-level predicates (solver_name, is_susy, is_isd) apply post-load. columns projects the result down to a subset of columns before returning.

h11, h12 are interpreted in mirror (jaxvacua / lcs_tree) convention, matching stringforge.lcs.LCSDatabase.

Parameters:

run_id (Optional[str]) – Exact match.
geometry_id (Optional[str]) – Exact match.
h11 (Optional[int]) – Exact match per geometry key.
h12 (Optional[int]) – Exact match per geometry key.
ks_id (Optional[int]) – Exact match per geometry key.
triang_id (Optional[int]) – Exact match per geometry key.
solver_name (Optional[str]) – Row-level filter on the solver_name column.
is_susy (Optional[bool]) – Row-level filter on the is_susy column.
is_isd (Optional[bool]) – Row-level filter on the is_isd column.
columns (Optional[Iterable[str]]) – Optional column projection.

Returns:

pd.DataFrame – Concatenated rows from matching shards.
Empty when no shard satisfies the catalogue predicates.

Return type:

DataFrame

read_columns(path_in_repo, columns)#

Read only the named columns from one shard, skipping any that the shard does not carry.

A columnar read used by snapshot manifest generation: tallying verifier_id / cert_status over a large dataset must not pay the cost of materialising every column. Does not populate the LRU (it returns a column projection, not the full shard).

Parameters:

path_in_repo (str) – The shard’s repo-relative path.
columns (Iterable[str]) – Column names to read; missing ones are skipped.

Returns:

pd.DataFrame – The projection (empty-column frame if none of
the requested columns are present).

Return type:

DataFrame

property revision: str | None#: The pinned HuggingFace revision, or None for local / default-branch reads.

reader

Contents

reader#