sync#
Vulcan sync tier: batch staged shards into HuggingFace commits.
A single sync_pending() call moves the contents of
{staging_dir}/pending/ into synced/ (or failed/) by
issuing batched HfApi.create_commit operations. At most
DEFAULT_BUDGET commits are
issued per rolling-window hour (subject to the persisted budget
state); each commit can carry up to max_batch files.
On a 429 from HuggingFace the offending batch is retried with
exponential backoff that honours the server’s Retry-After;
persistent failures land in failed/ for manual inspection.
- stringforge.vulcan.sync.BUDGET_STATE_FILENAME: str = '_commit_budget.json'#
Filename of the persisted commit-window state under the staging directory. One file per Vulcan staging dir; the budget is per-staging-dir, not per-repo, so a single sync process targeting multiple repos through one staging area shares the budget across repos. (For multi-repo writes you’d use multiple staging dirs.)
- stringforge.vulcan.sync.DEFAULT_MAX_BATCH: int = 500#
Default maximum files per commit. HuggingFace places no strict cap, but very large commits slow down the Hub UI and the diff viewer. 500 keeps shard-batches usable.
- stringforge.vulcan.sync.DEFAULT_MAX_RETRIES: int = 3#
Default maximum retries on a single batch hitting a rate-limit response. Three retries with exponential backoff covers most transient outages; beyond that we relegate the batch to
failed/.
- class stringforge.vulcan.sync.SyncReport(n_committed=0, n_failed=0, n_remaining=0, n_commits=0, commit_oids=<factory>, errors=<factory>)#
Bases:
objectSummary of a single
sync_pending()invocation.- n_committed#
Total files that landed in successful commits.
- n_failed#
Files moved into
failed/after exhausting retries.
- n_remaining#
Files still in
pending/(budget-exhausted).
- n_commits#
Number of HF commits issued.
- commit_oids#
Per-commit OIDs (None entries for dry-run).
- errors#
Strings describing per-batch failures, when any.
- stringforge.vulcan.sync.remaining_budget(staging_dir, *, budget=90, window_s=3600.0)#
Report how many commit slots are currently available for the given staging directory.
- stringforge.vulcan.sync.sync_pending(staging_dir, *, repo_id, token=None, budget=90, window_s=3600.0, max_batch=500, max_retries=3, repo_type='dataset', create_pr=False, revision=None, dry_run=False, commit_prefix='vulcan: forge')#
Drain
{staging_dir}/pending/into HuggingFace commits.Shards are processed in batches of at most
max_batchfiles. Each batch consumes one slot from the rolling-window budget (advisory: seeCommitBudget); once the budget is exhausted, remaining batches stay inpending/for a later invocation to pick up. Successful batches move their shards tosynced/; persistent failures (aftermax_retriesrate-limit retries with exponential backoff) move tofailed/for manual inspection.- Parameters:
staging_dir (
Path) – Vulcan staging root.repo_id (
str) – HuggingFace dataset repo target.token (
Optional[str]) – HF write token; falls back toHF_TOKEN.budget (
int) – Maximum commits permitted per rollingwindow_s.window_s (
float) – Window length (seconds). Defaults to 1 hour.max_batch (
int) – Maximum files per HuggingFace commit.max_retries (
int) – Per-batch retry budget on 429/503 responses.repo_type (
str) – HF repo type; default"dataset".create_pr (
bool) – Open PRs instead of pushing to the default branch.revision (
Optional[str]) – HF revision name; defaults to the repo’s default branch.dry_run (
bool) – Skip the network call. Shards stay inpending/; budget slots are still reserved.commit_prefix (
str) – Prefix for the auto-generated commit message.
- Returns:
SyncReport – Summary of what landed, what remains, and what
failed.
- Return type: