Vulcan: production vacuum-forging for cluster workloads#
Audience. Researchers running large flux-vacuum scans across many cluster nodes who want to publish the results to a shared HuggingFace dataset repo without tripping over HuggingFace’s commit-rate limits.
What you will learn. How to use stringforge.vulcan.Vulcan to stage vacuum batches from worker nodes, sync them in batched commits, and query / consume the resulting HuggingFace dataset for further analysis or ML training.
Where Vulcan sits in the ecosystem.
stringforge.vacua_writer.VacuaWriter– designates low-volume, paper-aligned vacua into the curatedvacua_vaultrepo.stringforge.vulcan.Vulcan– forges high-volume cluster output into a separate production repo with a fixed shard-level schema (the vault floor columnsflux,moduli_re,moduli_im,tau_re,tau_implus auto-populatedrun_id,geometry_id, geometry keys,tadpole_charge,created_at) and deterministicgeometry_id-hashed train/val/test splits, so downstream ML pipelines can consume the dataset without bespoke per-run plumbing.
The two repositories are deliberately distinct: the curated vault is for papers, Vulcan is for production runs. They share the parquet floor (flux, moduli_re, moduli_im, tau_re, tau_im), so a future promotion step can lift a Vulcan run into the curated vault.
The single architectural fact that drives everything below. HuggingFace caps direct commits at 100 per hour per repo. A cluster running 200 parallel workers, each committing on completion, would blow through that cap in seconds. Vulcan solves this by decoupling worker writes (local, no HF I/O) from sync (head-node-only, batched into one commit per ~500 files).
1. The write / sync split#
cluster worker --> Vulcan.write(...) (local parquet, no HF)
cluster worker --> Vulcan.write(...) (local parquet, no HF)
cluster worker --> Vulcan.write(...) (local parquet, no HF)
|
v
{staging_dir}/pending/
|
v
Vulcan.sync() or `python -m stringforge.vulcan sync`
|
v batched HfApi.create_commit(operations=[...])
HF repo
Worker code stays simple: it imports Vulcan, calls forge.write(...), and exits. No HuggingFace token on worker nodes. No network on worker nodes. No rate-limit concern.
The sync tier is the only part of the system that touches HuggingFace. It runs on a head node (cron, daemon, or manual invocation), batches up to max_batch=500 files into a single create_commit, and respects a rolling-window budget (default 90 commits/hour – a 10-commit safety margin under HF’s 100/hour cap).
Shared-filesystem requirement. On a real cluster,
{staging_dir}must be a path visible to every worker AND the head node (NFS, Lustre, or job-local scratch followed by post-job rsync to a shared location). It is the bus that carries shards fromwriteon the workers tosyncon the head node; if each worker has its own private staging dir, the head-node sync will see an emptypending/. Thetempfile.mkdtemp(...)used in section 2 is for the single-process demo only – do not copy it into a job-submission script.
2. Configure Vulcan for the run#
Most clusters want to set the configuration once via environment variables and use Vulcan.from_env() everywhere:
import os
import tempfile
from pathlib import Path
# In a real cluster run you'd set these in the job submission script
# (or in your shell profile on the head node) and forget about them.
_tmp = Path(tempfile.mkdtemp(prefix='vulcan_demo_')).resolve()
os.environ['STRINGFORGE_VULCAN_REPO'] = 'aschachner/vacua_forge_demo'
os.environ['STRINGFORGE_VULCAN_STAGING_DIR'] = str(_tmp)
os.environ['STRINGFORGE_VULCAN_PROJECT'] = 'demo'
os.environ['STRINGFORGE_VULCAN_BUDGET'] = '90' # commits/hour ceiling
# os.environ['STRINGFORGE_VULCAN_TOKEN'] = ... # set on the head node only
from stringforge.vulcan import Vulcan
forge = Vulcan.from_env()
forge
3. Run-id template: flexible naming with optional keys#
Vulcan generates a run_id per write() call. The default template carries the date, project tag, Hodge numbers, KS/triangulation IDs, and seed; placeholders ending in ? fall back to na when the value is unknown (e.g. a CICY model has no ks_id):
DEFAULT_TEMPLATE = (
"{date}_{project}_h11-{h11?}_h12-{h12?}_"
"ks-{ks_id?}_triang-{triang_id?}_seed-{seed?}"
)
Override the template at construction time to add project-specific keys – for instance a {conifold_id} field for KKLT runs or a {variant} field for ISD scans:
kklt_forge = Vulcan(
repo='aschachner/vacua_forge_demo',
staging_dir=_tmp,
project='kklt-scan',
run_id_template=(
'{date}_{project}_h12-{h12}_ks-{ks_id}_'
'conifold-{conifold_id?}_variant-{variant?}_seed-{seed?}'
),
)
kklt_forge.render_run_id(h12=2, ks_id=384564, conifold_id=7, variant='isd-only', seed=42)
Best practice in a cluster context: put job-identity information (array_id, seed, variant) into extra_kwargs once at the top of the worker script, then never think about run_ids again – every forge.write(...) automatically renders a unique, traceable identifier.
4. Worker-side: stage a batch of vacua#
Inside a worker process (one per cluster node, or one per JAX-vacua-search invocation), call forge.write() once per converged batch of vacua. The DataFrame must carry the vault floor columns (flux, moduli_re, moduli_im, tau_re, tau_im); Vulcan fills in run_id, geometry_id, the geometry keys, tadpole_charge, and created_at itself.
In this notebook, write and sync run in the same Python process for clarity; section 5.1 formalises the real-cluster split.
Each forge.write(...) mints a fresh run_id from the current UTC second (see DEFAULT_TEMPLATE in vulcan/runid.py), so re-executing this cell stages an additional shard rather than overwriting. To start clean, re-run section 2.
import pandas as pd
import numpy as np
# Each cluster node would normally produce this DataFrame from its
# vacuum search (e.g. via JAXVacua's FluxVacuaFinder). Here we
# synthesise something representative.
n_vacua = 8
rng = np.random.default_rng(42)
vacua = pd.DataFrame({
# one row per converged vacuum; in a real search each row has its own flux
'flux': [list(rng.integers(-3, 4, size=6)) for _ in range(n_vacua)],
'moduli_re': [list(rng.normal(0, 0.5, size=2)) for _ in range(n_vacua)],
'moduli_im': [list(2.0 + 0.5 * rng.standard_normal(2)) for _ in range(n_vacua)],
'tau_re': rng.normal(0, 0.1, size=n_vacua),
'tau_im': 2.5 + 1.0 * rng.random(n_vacua),
'residual': 1e-11 * rng.random(n_vacua),
'is_susy': [True] * n_vacua,
'solver_name': ['newton'] * n_vacua,
'n_iterations': rng.integers(5, 30, size=n_vacua),
})
geometry = {'h11': 3, 'h12': 2, 'ks_id': 384564, 'triang_id': 0}
shard = forge.write(
vacua,
geometry=geometry,
tadpole_charge=12,
solver={'name': 'newton', 'config_hash': 'abc123'},
provenance={'git_sha': 'deadbeef', 'seed': 42, 'wall_clock_s': 3.4},
)
print('run_id :', shard.run_id)
print('path_in_repo :', shard.path_in_repo)
print('rows staged :', shard.n_rows)
print('parquet on disk :', shard.parquet_path.name)
The dict passed to solver= now also auto-denormalises into a per-row solver_name column (see vulcan/writer.py); downstream queries can filter on either.
Things to notice:
No HuggingFace I/O happened. This call is safe to make from any cluster node, including ones without network access.
The parquet write is atomic. The body is flushed to
<file>.parquet.tmpandos.replace-d into place. A killed worker leaves at most a stray.tmpfile – never a half-written shard the sync tier could mistakenly upload.A sidecar JSON is written next to the parquet. This is local-only provenance bookkeeping; sidecars are never committed to HuggingFace (the same blob is already embedded in the parquet’s kv-metadata).
5. Best practices for cluster runs#
These are the patterns that have survived contact with real cluster workloads. Skim them once, then bake them into your job-submission template.
5.1 Separate worker nodes from the sync head node#
Worker nodes get the
STRINGFORGE_VULCAN_STAGING_DIRenv var pointing to a shared filesystem path (NFS / Lustre / scratch + post-job rsync). They do not getSTRINGFORGE_VULCAN_TOKEN.A head node (one) gets both
STRINGFORGE_VULCAN_STAGING_DIR(same path) andSTRINGFORGE_VULCAN_TOKEN. It runs the sync tier; nothing else.This guarantees only one HF endpoint per repo per cluster – the rate-budget accounting in
{staging_dir}/_commit_budget.jsonis reliable.
5.2 Stage in batches; do not write one shard per vacuum#
Each
forge.write()produces one shard. With 1000 vacua per geometry, write once per geometry batch, not once per vacuum. The schema validatesgeometry_idhomogeneity per shard – the writer enforces this for you.
5.3 Schedule the sync via cron#
*/15 * * * * /path/to/venv/bin/python -m stringforge.vulcan sync \
--repo aschachner/vacua_forge_prod \
--budget 90 --max-batch 500
Every 15 min the head node drains pending/ into a batched commit. At 500 files / commit and 4 commits / hour you can comfortably absorb up to 2000 distinct shards per hour with 86 commits / hour of headroom for ad-hoc operations (the 90/hour default minus the 4 cron-driven syncs).
5.4 Verify with status before scheduling the sync#
python -m stringforge.vulcan status
prints the staging directory and pending shard count. Run it once before adding the cron entry; you should see pending shards: 0 when the cluster is quiet and a non-zero count between worker bursts.
5.5 If you genuinely exceed 100 commits/hour#
Two escape valves – only enable when the four mitigations above (separated write/sync nodes (5.1), per-geometry batched staging (5.2), cron-scheduled batched sync (5.3), and status verification (5.4)) plus the 90-commit budget prove insufficient:
Sharded commits across multiple repos. Run two or more
Vulcaninstances, each pointing at a different HF repo, with consistent-hashing ongeometry_idto pick which. Aggregate throughput beyond 100 commits/hour at the cost of query-side complexity.Pull-request mode.
forge.sync(create_pr=True)opens PRs instead of pushing to the default branch. Useful when a project wants a review step before promotion.
5.6 Recovery patterns#
If the head node crashes mid-sync, the next
vulcan syncinvocation simply picks up where the previous one stopped – pending shards stay inpending/until a successful commit moves them tosynced/.Files in
failed/are shards that hit persistent429/503responses. Inspect, fix the credential / repo configuration, thenmv staging/failed/* staging/pending/and re-sync.synced/is kept after upload as a local provenance ledger. Trim it on a slow cadence (weekly) if disk pressure is a concern.
(There is a small window between a successful HF commit and the local move to synced/; a crash inside it would cause the next sync to re-upload the same files. The HF commit is a no-op in content terms but still consumes one budget slot. No data loss in either case.)
6. The sync tier in dry-run mode#
Before pointing vulcan sync at a real HuggingFace repo, exercise the full pipeline locally with dry_run=True. No network call is made, and pending shards stay in pending/ so the test is non-destructive.
We deliberately set max_batch=2 so two commits are issued instead of one, exposing both the per-commit grouping and the budget accounting.
# Stage a few more shards across distinct geometries so the batching
# behaviour is visible.
for i, ks in enumerate([12345, 67890, 11111]):
geom_i = {'h11': 3, 'h12': 2, 'ks_id': ks, 'triang_id': 0}
forge.write(
vacua,
geometry=geom_i,
tadpole_charge=12,
provenance={'seed': 100 + i},
)
print('pending before sync :', len(forge.list_pending()))
report = forge.sync(max_batch=2, dry_run=True)
print('commits issued :', report.n_commits)
print('committed (rows) :', report.n_committed)
print('failed :', report.n_failed)
print('remaining :', report.n_remaining)
print('budget left now :', forge.remaining_budget())
Notice that remaining_budget() decreased even though no HF call happened – dry-run still reserves slots so a dry-run-then-real workflow is conservative.
# For the rest of this tutorial we need the shards in ``synced/`` so the
# reader in section 7 has something to scan. The dry-run sync above
# deliberately leaves shards in ``pending/`` (so a dry-run-then-real
# workflow stays conservative), so here we promote them locally with
# the same helper the real sync tier uses on success. No HF I/O occurs.
from stringforge.vulcan.writer import SYNCED_DIRNAME, list_pending, move_to
for shard in list_pending(forge.staging_dir):
move_to(shard, forge.staging_dir, SYNCED_DIRNAME)
print('pending after promotion:', len(forge.list_pending()))
7. Read-side: query the synced shards#
After sync, the shards are mirrored under staging_dir/synced/. VulcanReader.from_local(staging_dir) builds a lightweight catalogue from each shard’s parquet kv-metadata (no row groups are read at catalogue time), and query() filters on either catalogue-level keys (h11, h12, ks_id, …) or row-level predicates (solver_name, is_susy, …).
reader = forge.reader()
catalog = reader.catalog()
print('shards in catalogue :', len(catalog))
catalog[['run_id', 'h11', 'h12', 'ks_id', 'n_rows']].head()
# Catalogue-level filter -- only the shards whose ks_id matches are
# even loaded from disk.
rows = forge.query(ks_id=384564)
print(f'rows matching ks_id=384564: {len(rows)}')
rows[['run_id', 'tau_im', 'is_susy', 'solver_name']].head()
# Combined catalogue + row-level filter.
susy = forge.query(h12=2, solver_name='newton', is_susy=True)
print(f'SUSY newton rows at h12=2: {len(susy)}')
8. ML view: deterministic train / val / test splits#
Vulcan.ml_view() returns a VulcanMLView that hashes each geometry_id to a fixed split. Rows from the same Calabi-Yau geometry always land in the same split – no leakage. The split is deterministic across processes, machines, and Python versions, so re-running training with a refreshed dataset produces the same partitioning.
view = forge.ml_view()
split_table = view.split_assignments()
split_table.head()
from stringforge.vulcan import FeatureSpec
# A typical featurisation: pad variable-length list columns to a fixed
# length so the result is directly tensorisable by torch / JAX.
spec = FeatureSpec(
features=['flux', 'moduli_re', 'moduli_im', 'tau_re', 'tau_im', 'is_susy'],
list_max_len=12,
list_fill=0,
keep_geometry_id=True,
)
train = view.as_dataframe('train', feature_spec=spec)
val = view.as_dataframe('val', feature_spec=spec)
test = view.as_dataframe('test', feature_spec=spec)
for name, df in [('train', train), ('val', val), ('test', test)]:
print(f'{name:5s}: {len(df):5d} rows, {df["geometry_id"].nunique() if not df.empty else 0} distinct geometries')
If you have datasets installed, view.as_hf_dataset('train') returns a streaming datasets.Dataset ready for torch.utils.data.DataLoader or JAX pipelines.
9. Inspecting the staging layout#
Final sanity check – the staging directory after a dry-run sync:
for sub in ('pending', 'synced', 'failed'):
n = sum(1 for _ in (_tmp / sub).rglob('*.parquet'))
print(f'{sub:7s}: {n} parquet file(s)')
print('budget state file:', (_tmp / '_commit_budget.json').exists())
10. Summary – the rules for production cluster runs#
Workers stage, the head node syncs. Worker code calls
forge.write(...); only one head-node process runsforge.sync(...).Use a shared filesystem for staging. All workers and the sync head node must agree on
STRINGFORGE_VULCAN_STAGING_DIR.Set
STRINGFORGE_VULCAN_TOKENonly on the head node. Tokens have no business on worker nodes.One write call per converged batch. Not per vacuum. The schema validates
geometry_idhomogeneity per shard.Use the default budget (90 commits/hour). It leaves a 10-commit safety margin under HuggingFace’s 100/hour cap.
Cron the sync at 5-15 minute cadence. That covers any reasonable cluster throughput without burning commit budget on near-empty batches.
Dry-run-then-real. Use
forge.sync(dry_run=True)on a representative load before the first real run.Monitor
failed/. Persistent failures land here; inspect, fix configuration, move back topending/.Trust the geometry-hash splits.
VulcanMLViewkeeps the samegeometry_idin the same split across every training run – it’s safe to re-derive splits from scratch each time.