Vulcan: production vacuum-forging for cluster workloads

Vulcan: production vacuum-forging for cluster workloads#

Audience. Researchers running large flux-vacuum scans across many cluster nodes who want to publish the results to a shared HuggingFace dataset repo without tripping over HuggingFace’s commit-rate limits.

What you will learn. How to use stringforge.vulcan.Vulcan to stage vacuum batches from worker nodes, sync them in batched commits, and query / consume the resulting HuggingFace dataset for further analysis or ML training.

Where Vulcan sits in the ecosystem.

stringforge.vacua_writer.VacuaWriter – designates low-volume, paper-aligned vacua into the curated vacua_vault repo.
stringforge.vulcan.Vulcan – forges high-volume cluster output into a separate production repo with a fixed shard-level schema (the vault floor columns flux, moduli_re, moduli_im, tau_re, tau_im plus auto-populated run_id, geometry_id, geometry keys, tadpole_charge, created_at) and deterministic geometry_id-hashed train/val/test splits, so downstream ML pipelines can consume the dataset without bespoke per-run plumbing.

The two repositories are deliberately distinct: the curated vault is for papers, Vulcan is for production runs. They share the parquet floor (flux, moduli_re, moduli_im, tau_re, tau_im), so a future promotion step can lift a Vulcan run into the curated vault.

The single architectural fact that drives everything below. HuggingFace caps direct commits at 100 per hour per repo. A cluster running 200 parallel workers, each committing on completion, would blow through that cap in seconds. Vulcan solves this by decoupling worker writes (local, no HF I/O) from sync (head-node-only, batched into one commit per ~500 files).

1. The write / sync split#

cluster worker  --> Vulcan.write(...)         (local parquet, no HF)
cluster worker  --> Vulcan.write(...)         (local parquet, no HF)
cluster worker  --> Vulcan.write(...)         (local parquet, no HF)
                          |
                          v
                  {staging_dir}/pending/
                          |
                          v
                Vulcan.sync()  or  `python -m stringforge.vulcan sync`
                          |
                          v   batched HfApi.create_commit(operations=[...])
                       HF repo

Worker code stays simple: it imports Vulcan, calls forge.write(...), and exits. No HuggingFace token on worker nodes. No network on worker nodes. No rate-limit concern.

The sync tier is the only part of the system that touches HuggingFace. It runs on a head node (cron, daemon, or manual invocation), batches up to max_batch=500 files into a single create_commit, and respects a rolling-window budget (default 90 commits/hour – a 10-commit safety margin under HF’s 100/hour cap).

Shared-filesystem requirement. On a real cluster, {staging_dir} must be a path visible to every worker AND the head node (NFS, Lustre, or job-local scratch followed by post-job rsync to a shared location). It is the bus that carries shards from write on the workers to sync on the head node; if each worker has its own private staging dir, the head-node sync will see an empty pending/. The tempfile.mkdtemp(...) used in section 2 is for the single-process demo only – do not copy it into a job-submission script.

2. Configure Vulcan for the run#

Most clusters want to set the configuration once via environment variables and use Vulcan.from_env() everywhere:

import os
import tempfile
from pathlib import Path

# In a real cluster run you'd set these in the job submission script
# (or in your shell profile on the head node) and forget about them.
_tmp = Path(tempfile.mkdtemp(prefix='vulcan_demo_')).resolve()
os.environ['STRINGFORGE_VULCAN_REPO']         = 'aschachner/vacua_forge_demo'
os.environ['STRINGFORGE_VULCAN_STAGING_DIR']  = str(_tmp)
os.environ['STRINGFORGE_VULCAN_PROJECT']      = 'demo'
os.environ['STRINGFORGE_VULCAN_BUDGET']       = '90'   # commits/hour ceiling
# os.environ['STRINGFORGE_VULCAN_TOKEN']      = ...    # set on the head node only

from stringforge.vulcan import Vulcan
forge = Vulcan.from_env()
forge

3. Run-id template: flexible naming with optional keys#

Vulcan generates a run_id per write() call. The default template carries the date, project tag, Hodge numbers, KS/triangulation IDs, and seed; placeholders ending in ? fall back to na when the value is unknown (e.g. a CICY model has no ks_id):

DEFAULT_TEMPLATE = (
    "{date}_{project}_h11-{h11?}_h12-{h12?}_"
    "ks-{ks_id?}_triang-{triang_id?}_seed-{seed?}"
)

Override the template at construction time to add project-specific keys – for instance a {conifold_id} field for KKLT runs or a {variant} field for ISD scans:

kklt_forge = Vulcan(
    repo='aschachner/vacua_forge_demo',
    staging_dir=_tmp,
    project='kklt-scan',
    run_id_template=(
        '{date}_{project}_h12-{h12}_ks-{ks_id}_'
        'conifold-{conifold_id?}_variant-{variant?}_seed-{seed?}'
    ),
)
kklt_forge.render_run_id(h12=2, ks_id=384564, conifold_id=7, variant='isd-only', seed=42)

Best practice in a cluster context: put job-identity information (array_id, seed, variant) into extra_kwargs once at the top of the worker script, then never think about run_ids again – every forge.write(...) automatically renders a unique, traceable identifier.

4. Worker-side: stage a batch of vacua#

Inside a worker process (one per cluster node, or one per JAX-vacua-search invocation), call forge.write() once per converged batch of vacua. The DataFrame must carry the vault floor columns (flux, moduli_re, moduli_im, tau_re, tau_im); Vulcan fills in run_id, geometry_id, the geometry keys, tadpole_charge, and created_at itself.

In this notebook, write and sync run in the same Python process for clarity; section 5.1 formalises the real-cluster split.

Each forge.write(...) mints a fresh run_id from the current UTC second (see DEFAULT_TEMPLATE in vulcan/runid.py), so re-executing this cell stages an additional shard rather than overwriting. To start clean, re-run section 2.

import pandas as pd
import numpy as np

# Each cluster node would normally produce this DataFrame from its
# vacuum search (e.g. via JAXVacua's FluxVacuaFinder).  Here we
# synthesise something representative.
n_vacua = 8
rng = np.random.default_rng(42)
vacua = pd.DataFrame({
    # one row per converged vacuum; in a real search each row has its own flux
    'flux': [list(rng.integers(-3, 4, size=6)) for _ in range(n_vacua)],
    'moduli_re':  [list(rng.normal(0, 0.5, size=2))          for _ in range(n_vacua)],
    'moduli_im':  [list(2.0 + 0.5 * rng.standard_normal(2)) for _ in range(n_vacua)],
    'tau_re':     rng.normal(0, 0.1, size=n_vacua),
    'tau_im':     2.5 + 1.0 * rng.random(n_vacua),
    'residual':   1e-11 * rng.random(n_vacua),
    'is_susy':    [True] * n_vacua,
    'solver_name': ['newton'] * n_vacua,
    'n_iterations': rng.integers(5, 30, size=n_vacua),
})

geometry = {'h11': 3, 'h12': 2, 'ks_id': 384564, 'triang_id': 0}

shard = forge.write(
    vacua,
    geometry=geometry,
    tadpole_charge=12,
    solver={'name': 'newton', 'config_hash': 'abc123'},
    provenance={'git_sha': 'deadbeef', 'seed': 42, 'wall_clock_s': 3.4},
)
print('run_id          :', shard.run_id)
print('path_in_repo    :', shard.path_in_repo)
print('rows staged     :', shard.n_rows)
print('parquet on disk :', shard.parquet_path.name)

The dict passed to solver= now also auto-denormalises into a per-row solver_name column (see vulcan/writer.py); downstream queries can filter on either.

Things to notice:

No HuggingFace I/O happened. This call is safe to make from any cluster node, including ones without network access.
The parquet write is atomic. The body is flushed to <file>.parquet.tmp and os.replace-d into place. A killed worker leaves at most a stray .tmp file – never a half-written shard the sync tier could mistakenly upload.
A sidecar JSON is written next to the parquet. This is local-only provenance bookkeeping; sidecars are never committed to HuggingFace (the same blob is already embedded in the parquet’s kv-metadata).

5. Best practices for cluster runs#

These are the patterns that have survived contact with real cluster workloads. Skim them once, then bake them into your job-submission template.

5.1 Separate worker nodes from the sync head node#

Worker nodes get the STRINGFORGE_VULCAN_STAGING_DIR env var pointing to a shared filesystem path (NFS / Lustre / scratch + post-job rsync). They do not get STRINGFORGE_VULCAN_TOKEN.
A head node (one) gets both STRINGFORGE_VULCAN_STAGING_DIR (same path) and STRINGFORGE_VULCAN_TOKEN. It runs the sync tier; nothing else.
This guarantees only one HF endpoint per repo per cluster – the rate-budget accounting in {staging_dir}/_commit_budget.json is reliable.

5.2 Stage in batches; do not write one shard per vacuum#

Each forge.write() produces one shard. With 1000 vacua per geometry, write once per geometry batch, not once per vacuum. The schema validates geometry_id homogeneity per shard – the writer enforces this for you.

5.3 Schedule the sync via cron#

*/15 * * * * /path/to/venv/bin/python -m stringforge.vulcan sync \
    --repo aschachner/vacua_forge_prod \
    --budget 90 --max-batch 500

Every 15 min the head node drains pending/ into a batched commit. At 500 files / commit and 4 commits / hour you can comfortably absorb up to 2000 distinct shards per hour with 86 commits / hour of headroom for ad-hoc operations (the 90/hour default minus the 4 cron-driven syncs).

5.4 Verify with `status` before scheduling the sync#

python -m stringforge.vulcan status

prints the staging directory and pending shard count. Run it once before adding the cron entry; you should see pending shards: 0 when the cluster is quiet and a non-zero count between worker bursts.

5.5 If you genuinely exceed 100 commits/hour#

Two escape valves – only enable when the four mitigations above (separated write/sync nodes (5.1), per-geometry batched staging (5.2), cron-scheduled batched sync (5.3), and status verification (5.4)) plus the 90-commit budget prove insufficient:

Sharded commits across multiple repos. Run two or more Vulcan instances, each pointing at a different HF repo, with consistent-hashing on geometry_id to pick which. Aggregate throughput beyond 100 commits/hour at the cost of query-side complexity.
Pull-request mode. forge.sync(create_pr=True) opens PRs instead of pushing to the default branch. Useful when a project wants a review step before promotion.

5.6 Recovery patterns#

If the head node crashes mid-sync, the next vulcan sync invocation simply picks up where the previous one stopped – pending shards stay in pending/ until a successful commit moves them to synced/.
Files in failed/ are shards that hit persistent 429/503 responses. Inspect, fix the credential / repo configuration, then mv staging/failed/* staging/pending/ and re-sync.
synced/ is kept after upload as a local provenance ledger. Trim it on a slow cadence (weekly) if disk pressure is a concern.

(There is a small window between a successful HF commit and the local move to synced/; a crash inside it would cause the next sync to re-upload the same files. The HF commit is a no-op in content terms but still consumes one budget slot. No data loss in either case.)

6. The sync tier in dry-run mode#

Before pointing vulcan sync at a real HuggingFace repo, exercise the full pipeline locally with dry_run=True. No network call is made, and pending shards stay in pending/ so the test is non-destructive.

We deliberately set max_batch=2 so two commits are issued instead of one, exposing both the per-commit grouping and the budget accounting.

# Stage a few more shards across distinct geometries so the batching
# behaviour is visible.
for i, ks in enumerate([12345, 67890, 11111]):
    geom_i = {'h11': 3, 'h12': 2, 'ks_id': ks, 'triang_id': 0}
    forge.write(
        vacua,
        geometry=geom_i,
        tadpole_charge=12,
        provenance={'seed': 100 + i},
    )

print('pending before sync :', len(forge.list_pending()))

report = forge.sync(max_batch=2, dry_run=True)
print('commits issued      :', report.n_commits)
print('committed (rows)    :', report.n_committed)
print('failed              :', report.n_failed)
print('remaining           :', report.n_remaining)
print('budget left now     :', forge.remaining_budget())

Notice that remaining_budget() decreased even though no HF call happened – dry-run still reserves slots so a dry-run-then-real workflow is conservative.

# For the rest of this tutorial we need the shards in ``synced/`` so the
# reader in section 7 has something to scan.  The dry-run sync above
# deliberately leaves shards in ``pending/`` (so a dry-run-then-real
# workflow stays conservative), so here we promote them locally with
# the same helper the real sync tier uses on success.  No HF I/O occurs.
from stringforge.vulcan.writer import SYNCED_DIRNAME, list_pending, move_to

for shard in list_pending(forge.staging_dir):
    move_to(shard, forge.staging_dir, SYNCED_DIRNAME)

print('pending after promotion:', len(forge.list_pending()))

7. Read-side: query the synced shards#

After sync, the shards are mirrored under staging_dir/synced/. VulcanReader.from_local(staging_dir) builds a lightweight catalogue from each shard’s parquet kv-metadata (no row groups are read at catalogue time), and query() filters on either catalogue-level keys (h11, h12, ks_id, …) or row-level predicates (solver_name, is_susy, …).

reader = forge.reader()
catalog = reader.catalog()
print('shards in catalogue :', len(catalog))
catalog[['run_id', 'h11', 'h12', 'ks_id', 'n_rows']].head()

# Catalogue-level filter -- only the shards whose ks_id matches are
# even loaded from disk.
rows = forge.query(ks_id=384564)
print(f'rows matching ks_id=384564: {len(rows)}')
rows[['run_id', 'tau_im', 'is_susy', 'solver_name']].head()

# Combined catalogue + row-level filter.
susy = forge.query(h12=2, solver_name='newton', is_susy=True)
print(f'SUSY newton rows at h12=2: {len(susy)}')

8. ML view: deterministic train / val / test splits#

Vulcan.ml_view() returns a VulcanMLView that hashes each geometry_id to a fixed split. Rows from the same Calabi-Yau geometry always land in the same split – no leakage. The split is deterministic across processes, machines, and Python versions, so re-running training with a refreshed dataset produces the same partitioning.

view = forge.ml_view()
split_table = view.split_assignments()
split_table.head()

from stringforge.vulcan import FeatureSpec

# A typical featurisation: pad variable-length list columns to a fixed
# length so the result is directly tensorisable by torch / JAX.
spec = FeatureSpec(
    features=['flux', 'moduli_re', 'moduli_im', 'tau_re', 'tau_im', 'is_susy'],
    list_max_len=12,
    list_fill=0,
    keep_geometry_id=True,
)

train = view.as_dataframe('train', feature_spec=spec)
val   = view.as_dataframe('val',   feature_spec=spec)
test  = view.as_dataframe('test',  feature_spec=spec)
for name, df in [('train', train), ('val', val), ('test', test)]:
    print(f'{name:5s}: {len(df):5d} rows, {df["geometry_id"].nunique() if not df.empty else 0} distinct geometries')

If you have datasets installed, view.as_hf_dataset('train') returns a streaming datasets.Dataset ready for torch.utils.data.DataLoader or JAX pipelines.

9. Inspecting the staging layout#

Final sanity check – the staging directory after a dry-run sync:

for sub in ('pending', 'synced', 'failed'):
    n = sum(1 for _ in (_tmp / sub).rglob('*.parquet'))
    print(f'{sub:7s}: {n} parquet file(s)')
print('budget state file:', (_tmp / '_commit_budget.json').exists())

10. Summary – the rules for production cluster runs#

Workers stage, the head node syncs. Worker code calls forge.write(...); only one head-node process runs forge.sync(...).
Use a shared filesystem for staging. All workers and the sync head node must agree on STRINGFORGE_VULCAN_STAGING_DIR.
Set STRINGFORGE_VULCAN_TOKEN only on the head node. Tokens have no business on worker nodes.
One write call per converged batch. Not per vacuum. The schema validates geometry_id homogeneity per shard.
Use the default budget (90 commits/hour). It leaves a 10-commit safety margin under HuggingFace’s 100/hour cap.
Cron the sync at 5-15 minute cadence. That covers any reasonable cluster throughput without burning commit budget on near-empty batches.
Dry-run-then-real. Use forge.sync(dry_run=True) on a representative load before the first real run.
Monitor failed/. Persistent failures land here; inspect, fix configuration, move back to pending/.
Trust the geometry-hash splits. VulcanMLView keeps the same geometry_id in the same split across every training run – it’s safe to re-derive splits from scratch each time.