stringforge.vulcan#
Vulcan: production vacuum-forging pipeline for StringForge.
Vulcan is the high-volume, cluster-side, append-only writer that
forges vacuum-search output into a HuggingFace dataset repo. It is
the production counterpart to
stringforge.vacua_writer.VacuaWriter – VacuaWriter
designates curated vacua into vacua_vault, Vulcan forges
production runs into a separate, ML-friendly repo.
Architecture summary:
Worker code calls
Vulcan.write()to stage validated parquet shards under{staging_dir}/pending/. No HuggingFace I/O.A separate sync tier (
Vulcan.sync(), thevulcan syncCLI, or a cron job on a head node) batches pending shards into onehuggingface_hub.HfApi.create_commit()call permax_batch-sized chunk (default 500 files per commit), respecting HF’s 100-commit-per-hour cap viaratelimit.
Vulcan#
Production vacuum-forging pipeline for StringForge. Vulcan is the
cluster-side counterpart to stringforge.vacua_writer.VacuaWriter:
where VacuaWriter designates curated vacua into the paper-aligned
vacua_vault repo, Vulcan forges high-volume production runs into a
separate, production HuggingFace repo with a uniform parquet schema and
geometry-disjoint VulcanMLView splits.
Worker code never touches HuggingFace. Vulcan.write() lands validated
parquet shards under {staging_dir}/pending/ together with a per-file
metadata sidecar. A separate sync tier (Vulcan.sync(), the
vulcan sync CLI, or a cron job on a head node) batches pending shards
into one huggingface_hub.HfApi.create_commit() call per
max_batch-sized chunk (default 500 files per commit), respecting
HuggingFace’s documented 100-commit-per-hour-per-repo cap
(see https://huggingface.co/docs/hub/repositories-recommendations) via the
rolling-window budget in ratelimit.
|
Per-repo handle for the production vacuum-forging pipeline. |
|
Read-side counterpart to |
|
Train/val/test view of a Vulcan source. |
Worker-side: stage vacua#
Vulcan.write()Vulcan.render_run_id()Vulcan.list_pending()Vulcan.remaining_budget()
Sync tier: HuggingFace commits#
Read path: query and fetch#
Vulcan.reader()Vulcan.query()Vulcan.fetch_run()VulcanReader.catalog()VulcanReader.query()VulcanReader.fetch_shard()
ML view: train / val / test splits#
Vulcan.ml_view()VulcanMLView.as_dataframe()VulcanMLView.as_hf_dataset()VulcanMLView.split_assignments()
Submodules#
Production parquet schema for the Vulcan tier. |
|
Run-ID template engine for Vulcan. |
|
Worker-side shard writer for Vulcan. |
|
Rolling-window commit-rate budget for the Vulcan sync tier. |
|
HuggingFace Hub I/O wrappers for Vulcan. |
|
Vulcan sync tier: batch staged shards into HuggingFace commits. |
|
Read path for Vulcan: catalogue scan, lazy shard fetch, and query. |
|
ML-friendly view of a Vulcan repo. |
Environment variables#
STRINGFORGE_VULCAN_REPO— target HuggingFace dataset repo. No package-level default; eachVulcanmust be told where to write.STRINGFORGE_VULCAN_STAGING_DIR— local staging root. When unset (and no explicitstaging_dir=is passed toVulcan),stringforge.vulcan.resolve_staging_dir()falls back to<repo_root>/vulcan_staging/inside a stringforge source checkout, or<stringforge.data_dir>/vulcan_staging/otherwise.STRINGFORGE_VULCAN_PROJECT— default{project}substitution for the run-id template.STRINGFORGE_VULCAN_BUDGET— commits-per-hour ceiling (default 90; the HuggingFace cap is 100).STRINGFORGE_VULCAN_TOKEN— HuggingFace write token. Falls back toHF_TOKEN.
Public setters in stringforge:
stringforge.set_vulcan_repo(),
stringforge.set_vulcan_staging_dir(),
stringforge.set_vulcan_project(),
stringforge.set_vulcan_budget(),
stringforge.set_vulcan_token().