Database Interface#
What’s in this notebook?
This notebook introduces the stringforge.cy_io module, which provides a
unified interface for loading Calabi-Yau geometry data from a hosted
HuggingFace dataset. The database is organised into sub-datasets:
tdf— toric models from the Kreuzer-Skarke (KS) list, identified by(ks_id, triang_id).cicy— Complete Intersection Calabi-Yau models, identified bycicy_id.
A lightweight catalog is downloaded once and cached locally. Individual geometry records, Gopakumar-Vafa invariants, conifold data, and additional model properties are fetched on demand — only the files you actually need are downloaded.
Prerequisites: pip install pandas pyarrow huggingface-hub
The database is hosted at aschachner/cy-database on HuggingFace. You can override the repository with the STRINGFORGE_HF_REPO environment variable.
Imports#
# General imports
import warnings
import numpy as np
# JAX imports
import jax
import jax.numpy as jnp
jax.config.update("jax_enable_x64", True)
# StringForge model-loading bridge
import stringforge as sf
from stringforge import LCSDatabase
def TDFDatabase(**kwargs): return LCSDatabase(dataset="tdf", **kwargs)
def CICYDatabase(**kwargs): return LCSDatabase(dataset="cicy", **kwargs)
from stringforge.lcs_database import load_tdf_model, load_cicy_model
from stringforge.cy_io import query_models
warnings.filterwarnings('ignore')
Setup#
Model-loading database access goes through an LCSDatabase instance (or the
convenience wrappers TDFDatabase / CICYDatabase used in this notebook). The constructor takes three
optional arguments:
Argument |
Default |
Description |
|---|---|---|
|
|
Sub-dataset identifier |
|
|
Local cache for downloaded files |
|
|
Serve only from cache; raise on any miss |
|
|
|
The catalog is downloaded automatically on first use and then served from disk on all subsequent calls.
# Create a database handle for the tdf sub-dataset
db = TDFDatabase()
print(db)
# For CICY models, use CICYDatabase (or LCSDatabase(dataset="cicy"))
db_cicy = CICYDatabase()
print(db_cicy)
Discovering models#
Overview#
db.info() prints a summary of the sub-dataset: total model count,
Hodge-number ranges, and the fraction of models with GV invariants and
conifold data available. This method operates entirely on the in-memory
catalog — no geometry data is downloaded.
db.info()
# Expected output (illustrative):
# CYDatabase — sub-dataset: 'tdf'
# Total models : 1,247,832
# h11 range : 1 – 491
# h12 range : 1 – 252
# With GV data : 85,421 (6.8%)
# With conifolds: 43,209 (3.5%)
# Cache dir : /path/to/.stringforge_cache/tdf
Querying the catalog#
db.query() filters the in-memory catalog and returns a pandas.DataFrame
of matching rows. Supported filters:
h11,h12— exact Hodge number matchhas_conifolds— restrict to models with at least one conifold limithas_gv— restrict to models with stored GV invariantsD3_tadpole_max— upper bound on the D3-tadpole \(\chi/24\)Any other catalog column as a keyword argument (exact match)
# All two-modulus models (h12 = 2)
df = db.query(h12=2)
print(f"Models with h12=2: {len(df)}")
df.head()
# Two-modulus models with a conifold limit and D3-tadpole <= 48
df_cf = db.query(h12=2, has_conifolds=True, D3_tadpole_max=48)
print(f"Models matching filter: {len(df_cf)}")
df_cf.head()
# Module-level shortcut (no explicit database object needed)
df_all = query_models(dataset="tdf", h12=2)
print(f"Total h12=2 models: {len(df_all)}")
Loading a single model#
db.load() fetches geometry data for a single model and returns an
lcs_tree object. For tdf models both ks_id and triang_id are
required; for cicy models only cicy_id is needed.
Optional arguments control which additional data splits are fetched:
Argument |
Default |
Effect |
|---|---|---|
|
|
Attach GV / GW invariants |
|
|
Attach conifold data ( |
|
|
Populate |
|
|
Restrict |
|
|
Truncate GV invariants to this degree |
# Load a model by its KS identifier — topology data only
ks_id, triang_id = int(df.iloc[0]["ks_id"]), int(df.iloc[0]["triang_id"])
tree = db.load(ks_id=ks_id, triang_id=triang_id)
print(tree)
print(f"h11 = {tree.h11}, h12 = {tree.h12}, chi = {tree.chi}")
print(f"Intersection numbers (COO):\n{tree.intnums_coo}")
# Inspect the extra_data dict — always populated with the primary key and
# any additional precomputed properties stored in the 'extra' split
print(tree.extra_data)
# Load the same model together with GV invariants (truncated to degree 3)
tree_gv = db.load(ks_id=ks_id, triang_id=triang_id,
include_gv=True, maximum_degree=3)
print(f"GV charges shape : {tree_gv.gvs['charges'].shape}")
print(f"GV invariants : {tree_gv.gvs['invariants'][:5]}")
# Load a model that has a conifold limit
row_cf = df_cf.iloc[0]
ks_id_cf, triang_id_cf = int(row_cf["ks_id"]), int(row_cf["triang_id"])
tree_cf = db.load(ks_id=ks_id_cf, triang_id=triang_id_cf,
include_conifolds=True)
print(f"ncf = {tree_cf.ncf}")
print(f"conifold_curve = {tree_cf.conifold_curve}")
# If a model has multiple conifold limits, pass include_conifolds="all"
# to obtain a list of trees — one per conifold
trees_all_cf = db.load(ks_id=ks_id_cf, triang_id=triang_id_cf,
include_conifolds="all")
print(f"Number of conifold trees: {len(trees_all_cf)}")
for t in trees_all_cf:
print(f" ncf={t.ncf}, conifold_curve={t.conifold_curve}")
Loading a FluxVacuaFinder model#
db.load_model() is a convenience wrapper that calls db.load() and
immediately constructs a JAXVacua FluxVacuaFinder object. Additional
keyword arguments are forwarded to FluxVacuaFinder.__init__, so you can
pass Q, limit, gauge_choice, etc. directly.
# Load directly as a FluxVacuaFinder — ready to use for vacuum search
model = db.load_model(ks_id=ks_id, triang_id=triang_id, Q=24)
print(model)
print(f"D3_tadpole = {model.D3_tadpole}")
print(f"n_fluxes = {model.n_fluxes}")
# Load with GV corrections enabled
model_gv = db.load_model(ks_id=ks_id, triang_id=triang_id,
include_gv=True, maximum_degree=3, Q=24)
print(f"use_gvs = {model_gv.use_gvs}")
Batch loading and random sampling#
For large-scale studies, db.load_batch() and db.sample() return lists
of lcs_tree objects. load_batch() accepts a pandas.DataFrame (e.g.
the output of db.query()) or a list of tuples.
# Load the first 5 models from a query result
trees = db.load_batch(df.head(5))
print(f"Loaded {len(trees)} trees")
for t in trees:
print(f" h11={t.h11}, h12={t.h12}, chi={t.chi}")
# Load from an explicit list of (ks_id, triang_id) tuples
ids = [(int(df.iloc[i]["ks_id"]), int(df.iloc[i]["triang_id"])) for i in range(3)]
trees_explicit = db.load_batch(ids)
print(f"Loaded {len(trees_explicit)} trees from explicit id list")
# Load all models matching a query — no separate query() call needed
trees_h12_2 = db.load_batch(h12=2, include_gv=True, maximum_degree=3)
print(f"Loaded {len(trees_h12_2)} h12=2 models with GV data")
# Draw 10 random h12=2 models — reproducible with seed
sampled = db.sample(n=10, h12=2, seed=42)
print(f"Sampled {len(sampled)} trees")
print(f"h12 values: {[t.h12 for t in sampled]}")
# Sample with conifold data attached
sampled_cf = db.sample(n=5, h12=2, has_conifolds=True,
include_conifolds=True, seed=0)
for t in sampled_cf:
print(f" ncf={t.ncf}, conifold_curve={t.conifold_curve}")
Working with extra_data#
Every loaded lcs_tree carries an extra_data dict populated with the
primary-key identifiers and any additional precomputed properties stored in
the extra split of the dataset. You can merge in your own computed results
using lcs_tree.update().
# Inspect extra_data — always contains the primary key at minimum
print(tree.extra_data)
# e.g. {'ks_id': 12345, 'triang_id': 0, 'h11': 3, 'h12': 2,
# 'chi': -12, 'D3_tadpole': 24, 'n_conifolds': 1, ...}
# Load only specific extra columns to keep memory usage low
tree_slim = db.load(ks_id=ks_id, triang_id=triang_id,
extra_fields=["ks_id", "triang_id", "D3_tadpole"])
print(tree_slim.extra_data)
# Annotate a tree with your own computed result
my_result = {"W0_min": 5.547e-5, "source": "arXiv:2501.03984"}
tree_annotated = tree.update(extra_data={**tree.extra_data, **my_result})
print(tree_annotated.extra_data["W0_min"])
Offline mode#
On HPC clusters or systems without internet access, pass offline=True.
The database will serve all requests from the local cache and raise a
FileNotFoundError if a required file has not been downloaded yet.
The recommended workflow is:
Run once with
offline=False(on a machine with internet) to warm the cache for the models and splits you need.Copy
./.stringforge_cache/to the HPC node, or pointSTRINGFORGE_DATA_DIRat a shared read-only cache.Use
offline=Trueon the HPC node.
# Create an offline database handle pointing at a pre-populated cache
db_offline = TDFDatabase(offline=True)
# This succeeds if the catalog is already cached
df_offline = db_offline.query(h12=2)
print(f"Found {len(df_offline)} h12=2 models in local cache")
Cache management#
clear_cache() deletes all cached shard files from disk. By default,
the catalog and any stored vacuum solutions are preserved.
For memory-efficient scanning of many models, use cache_mode="none" —
each shard is downloaded, the needed row is extracted, and the file is
deleted immediately.
# Delete all cached shards (catalog is preserved)
# db.clear_cache()
# For scanning millions of models without disk accumulation:
# db_lean = TDFDatabase(cache_mode="none")
# tree = db_lean.load(ks_id=ks_id, triang_id=triang_id) # shard deleted after read
Module-level convenience functions#
For one-off loads without creating an LCSDatabase object, three module-level
functions are available:
# Load a single tdf model by its KS identifiers
tree_quick = load_tdf_model(ks_id=ks_id, triang_id=triang_id)
print(f"h11={tree_quick.h11}, h12={tree_quick.h12}")
# Load a single CICY model
tree_cicy = load_cicy_model(cicy_id=7890)
print(f"h11={tree_cicy.h11}, h12={tree_cicy.h12}")
# Query without instantiating a database object
df_quick = query_models(dataset="tdf", h12=2, has_conifolds=True)
print(f"h12=2 models with conifolds: {len(df_quick)}")