stringforge.vulcan

stringforge.vulcan#

Vulcan: production vacuum-forging pipeline for StringForge.

Vulcan is the high-volume, cluster-side, append-only writer that forges vacuum-search output into a HuggingFace dataset repo. It is the production counterpart to stringforge.vacua_writer.VacuaWriter – VacuaWriter designates curated vacua into vacua_vault, Vulcan forges production runs into a separate, ML-friendly repo.

Architecture summary:

Worker code calls Vulcan.write() to stage validated parquet shards under {staging_dir}/pending/. No HuggingFace I/O.
A separate sync tier (Vulcan.sync(), the vulcan sync CLI, or a cron job on a head node) batches pending shards into one huggingface_hub.HfApi.create_commit() call per max_batch-sized chunk (default 500 files per commit), respecting HF’s 100-commit-per-hour cap via ratelimit.

Vulcan#

Production vacuum-forging pipeline for StringForge. Vulcan is the cluster-side counterpart to stringforge.vacua_writer.VacuaWriter: where VacuaWriter designates curated vacua into the paper-aligned vacua_vault repo, Vulcan forges high-volume production runs into a separate, production HuggingFace repo with a uniform parquet schema and geometry-disjoint VulcanMLView splits.

Worker code never touches HuggingFace. Vulcan.write() lands validated parquet shards under {staging_dir}/pending/ together with a per-file metadata sidecar. A separate sync tier (Vulcan.sync(), the vulcan sync CLI, or a cron job on a head node) batches pending shards into one huggingface_hub.HfApi.create_commit() call per max_batch-sized chunk (default 500 files per commit), respecting HuggingFace’s documented 100-commit-per-hour-per-repo cap (see https://huggingface.co/docs/hub/repositories-recommendations) via the rolling-window budget in ratelimit.

`Vulcan`(*, repo[, staging_dir, ...])	Per-repo handle for the production vacuum-forging pipeline.
`VulcanReader`(*, source[, cache_size, token])	Read-side counterpart to `stringforge.vulcan.Vulcan`.
`VulcanMLView`(reader, *[, fractions, splits, ...])	Train/val/test view of a Vulcan source.

Worker-side: stage vacua#

Vulcan.write()
Vulcan.render_run_id()
Vulcan.list_pending()
Vulcan.remaining_budget()

Sync tier: HuggingFace commits#

Vulcan.sync()
stringforge.vulcan.sync.sync_pending()
stringforge.vulcan.sync.remaining_budget()

Read path: query and fetch#

Vulcan.reader()
Vulcan.query()
Vulcan.fetch_run()
VulcanReader.catalog()
VulcanReader.query()
VulcanReader.fetch_shard()

ML view: train / val / test splits#

Vulcan.ml_view()
VulcanMLView.as_dataframe()
VulcanMLView.as_hf_dataset()
VulcanMLView.split_assignments()
stringforge.vulcan.ml_view.FeatureSpec
stringforge.vulcan.ml_view.assign_split()

Submodules#

`schema`	Production parquet schema for the Vulcan tier.
`runid`	Run-ID template engine for Vulcan.
`writer`	Worker-side shard writer for Vulcan.
`ratelimit`	Rolling-window commit-rate budget for the Vulcan sync tier.
`hf_io`	HuggingFace Hub I/O wrappers for Vulcan.
`sync`	Vulcan sync tier: batch staged shards into HuggingFace commits.
`reader`	Read path for Vulcan: catalogue scan, lazy shard fetch, and query.
`ml_view`	ML-friendly view of a Vulcan repo.

Environment variables#

STRINGFORGE_VULCAN_REPO — target HuggingFace dataset repo. No package-level default; each Vulcan must be told where to write.
STRINGFORGE_VULCAN_STAGING_DIR — local staging root. When unset (and no explicit staging_dir= is passed to Vulcan), stringforge.vulcan.resolve_staging_dir() falls back to <repo_root>/vulcan_staging/ inside a stringforge source checkout, or <stringforge.data_dir>/vulcan_staging/ otherwise.
STRINGFORGE_VULCAN_PROJECT — default {project} substitution for the run-id template.
STRINGFORGE_VULCAN_BUDGET — commits-per-hour ceiling (default 90; the HuggingFace cap is 100).
STRINGFORGE_VULCAN_TOKEN — HuggingFace write token. Falls back to HF_TOKEN.

Public setters in stringforge: stringforge.set_vulcan_repo(), stringforge.set_vulcan_staging_dir(), stringforge.set_vulcan_project(), stringforge.set_vulcan_budget(), stringforge.set_vulcan_token().