sync#

Vulcan sync tier: batch staged shards into HuggingFace commits.

A single sync_pending() call moves the contents of {staging_dir}/pending/ into synced/ (or failed/) by issuing batched HfApi.create_commit operations. At most DEFAULT_BUDGET commits are issued per rolling-window hour (subject to the persisted budget state); each commit can carry up to max_batch files.

On a 429 from HuggingFace the offending batch is retried with exponential backoff that honours the server’s Retry-After; persistent failures land in failed/ for manual inspection.

stringforge.vulcan.sync.BUDGET_STATE_FILENAME: str = '_commit_budget.json'#

Filename of the persisted commit-window state under the staging directory. One file per Vulcan staging dir; the budget is per-staging-dir, not per-repo, so a single sync process targeting multiple repos through one staging area shares the budget across repos. (For multi-repo writes you’d use multiple staging dirs.)

stringforge.vulcan.sync.DEFAULT_MAX_BATCH: int = 500#

Default maximum files per commit. HuggingFace places no strict cap, but very large commits slow down the Hub UI and the diff viewer. 500 keeps shard-batches usable.

stringforge.vulcan.sync.DEFAULT_MAX_RETRIES: int = 3#

Default maximum retries on a single batch hitting a rate-limit response. Three retries with exponential backoff covers most transient outages; beyond that we relegate the batch to failed/.

class stringforge.vulcan.sync.SyncReport(n_committed=0, n_failed=0, n_remaining=0, n_commits=0, commit_oids=<factory>, errors=<factory>)#

Bases: object

Summary of a single sync_pending() invocation.

n_committed#

Total files that landed in successful commits.

n_failed#

Files moved into failed/ after exhausting retries.

n_remaining#

Files still in pending/ (budget-exhausted).

n_commits#

Number of HF commits issued.

commit_oids#

Per-commit OIDs (None entries for dry-run).

errors#

Strings describing per-batch failures, when any.

stringforge.vulcan.sync.remaining_budget(staging_dir, *, budget=90, window_s=3600.0)#

Report how many commit slots are currently available for the given staging directory.

Parameters:
  • staging_dir (Path) – Vulcan staging root.

  • budget (int) – Ceiling to evaluate against (defaults to the module default).

  • window_s (float) – Window length in seconds.

Returns:

int – Non-negative slot count.

Return type:

int

stringforge.vulcan.sync.sync_pending(staging_dir, *, repo_id, token=None, budget=90, window_s=3600.0, max_batch=500, max_retries=3, repo_type='dataset', create_pr=False, revision=None, dry_run=False, commit_prefix='vulcan: forge')#

Drain {staging_dir}/pending/ into HuggingFace commits.

Shards are processed in batches of at most max_batch files. Each batch consumes one slot from the rolling-window budget (advisory: see CommitBudget); once the budget is exhausted, remaining batches stay in pending/ for a later invocation to pick up. Successful batches move their shards to synced/; persistent failures (after max_retries rate-limit retries with exponential backoff) move to failed/ for manual inspection.

Parameters:
  • staging_dir (Path) – Vulcan staging root.

  • repo_id (str) – HuggingFace dataset repo target.

  • token (Optional[str]) – HF write token; falls back to HF_TOKEN.

  • budget (int) – Maximum commits permitted per rolling window_s.

  • window_s (float) – Window length (seconds). Defaults to 1 hour.

  • max_batch (int) – Maximum files per HuggingFace commit.

  • max_retries (int) – Per-batch retry budget on 429/503 responses.

  • repo_type (str) – HF repo type; default "dataset".

  • create_pr (bool) – Open PRs instead of pushing to the default branch.

  • revision (Optional[str]) – HF revision name; defaults to the repo’s default branch.

  • dry_run (bool) – Skip the network call. Shards stay in pending/; budget slots are still reserved.

  • commit_prefix (str) – Prefix for the auto-generated commit message.

Returns:
  • SyncReport – Summary of what landed, what remains, and what

  • failed.

Return type:

SyncReport