stringforge.vulcan#

Vulcan: production vacuum-forging pipeline for StringForge.

Vulcan is the high-volume, cluster-side, append-only writer that forges vacuum-search output into a HuggingFace dataset repo. It is the production counterpart to stringforge.vacua_writer.VacuaWriter – VacuaWriter designates curated vacua into vacua_vault, Vulcan forges production runs into a separate, ML-friendly repo.

Architecture summary:

  • Worker code calls Vulcan.write() to stage validated parquet shards under {staging_dir}/pending/. No HuggingFace I/O.

  • A separate sync tier (Vulcan.sync(), the vulcan sync CLI, or a cron job on a head node) batches pending shards into one huggingface_hub.HfApi.create_commit() call per max_batch-sized chunk (default 500 files per commit), respecting HF’s 100-commit-per-hour cap via ratelimit.

Vulcan#

Production vacuum-forging pipeline for StringForge. Vulcan is the cluster-side counterpart to stringforge.vacua_writer.VacuaWriter: where VacuaWriter designates curated vacua into the paper-aligned vacua_vault repo, Vulcan forges high-volume production runs into a separate, production HuggingFace repo with a uniform parquet schema and geometry-disjoint VulcanMLView splits.

Worker code never touches HuggingFace. Vulcan.write() lands validated parquet shards under {staging_dir}/pending/ together with a per-file metadata sidecar. A separate sync tier (Vulcan.sync(), the vulcan sync CLI, or a cron job on a head node) batches pending shards into one huggingface_hub.HfApi.create_commit() call per max_batch-sized chunk (default 500 files per commit), respecting HuggingFace’s documented 100-commit-per-hour-per-repo cap (see https://huggingface.co/docs/hub/repositories-recommendations) via the rolling-window budget in ratelimit.

Vulcan(*, repo[, staging_dir, ...])

Per-repo handle for the production vacuum-forging pipeline.

VulcanReader(*, source[, cache_size, token])

Read-side counterpart to stringforge.vulcan.Vulcan.

VulcanMLView(reader, *[, fractions, splits, ...])

Train/val/test view of a Vulcan source.

Worker-side: stage vacua#

  • Vulcan.write()

  • Vulcan.render_run_id()

  • Vulcan.list_pending()

  • Vulcan.remaining_budget()

Sync tier: HuggingFace commits#

Read path: query and fetch#

  • Vulcan.reader()

  • Vulcan.query()

  • Vulcan.fetch_run()

  • VulcanReader.catalog()

  • VulcanReader.query()

  • VulcanReader.fetch_shard()

ML view: train / val / test splits#

Submodules#

schema

Production parquet schema for the Vulcan tier.

runid

Run-ID template engine for Vulcan.

writer

Worker-side shard writer for Vulcan.

ratelimit

Rolling-window commit-rate budget for the Vulcan sync tier.

hf_io

HuggingFace Hub I/O wrappers for Vulcan.

sync

Vulcan sync tier: batch staged shards into HuggingFace commits.

reader

Read path for Vulcan: catalogue scan, lazy shard fetch, and query.

ml_view

ML-friendly view of a Vulcan repo.

Environment variables#

  • STRINGFORGE_VULCAN_REPO — target HuggingFace dataset repo. No package-level default; each Vulcan must be told where to write.

  • STRINGFORGE_VULCAN_STAGING_DIR — local staging root. When unset (and no explicit staging_dir= is passed to Vulcan), stringforge.vulcan.resolve_staging_dir() falls back to <repo_root>/vulcan_staging/ inside a stringforge source checkout, or <stringforge.data_dir>/vulcan_staging/ otherwise.

  • STRINGFORGE_VULCAN_PROJECT — default {project} substitution for the run-id template.

  • STRINGFORGE_VULCAN_BUDGET — commits-per-hour ceiling (default 90; the HuggingFace cap is 100).

  • STRINGFORGE_VULCAN_TOKEN — HuggingFace write token. Falls back to HF_TOKEN.

Public setters in stringforge: stringforge.set_vulcan_repo(), stringforge.set_vulcan_staging_dir(), stringforge.set_vulcan_project(), stringforge.set_vulcan_budget(), stringforge.set_vulcan_token().