Database Interface

Database Interface#

What’s in this notebook?

This notebook introduces the stringforge.cy_io module, which provides a unified interface for loading Calabi-Yau geometry data from a hosted HuggingFace dataset. The database is organised into sub-datasets:

tdf — toric models from the Kreuzer-Skarke (KS) list, identified by (ks_id, triang_id).
cicy — Complete Intersection Calabi-Yau models, identified by cicy_id.

A lightweight catalog is downloaded once and cached locally. Individual geometry records, Gopakumar-Vafa invariants, conifold data, and additional model properties are fetched on demand — only the files you actually need are downloaded.

Prerequisites: pip install pandas pyarrow huggingface-hub

The database is hosted at aschachner/cy-database on HuggingFace. You can override the repository with the STRINGFORGE_HF_REPO environment variable.

Imports#

# General imports
import warnings
import numpy as np

# JAX imports
import jax
import jax.numpy as jnp
jax.config.update("jax_enable_x64", True)

# StringForge model-loading bridge
import stringforge as sf
from stringforge import LCSDatabase
def TDFDatabase(**kwargs):  return LCSDatabase(dataset="tdf", **kwargs)
def CICYDatabase(**kwargs): return LCSDatabase(dataset="cicy", **kwargs)
from stringforge.lcs_database import load_tdf_model, load_cicy_model
from stringforge.cy_io import query_models

warnings.filterwarnings('ignore')

Setup#

Model-loading database access goes through an LCSDatabase instance (or the convenience wrappers TDFDatabase / CICYDatabase used in this notebook). The constructor takes three optional arguments:

Argument	Default	Description
`dataset`	`"tdf"`	Sub-dataset identifier
`cache_dir`	`./.stringforge_cache`	Local cache for downloaded files
`offline`	`False`	Serve only from cache; raise on any miss
`cache_mode`	`"persistent"`	`"persistent"` keeps shards on disk; `"none"` deletes after each read

The catalog is downloaded automatically on first use and then served from disk on all subsequent calls.

# Create a database handle for the tdf sub-dataset
db = TDFDatabase()
print(db)

# For CICY models, use CICYDatabase (or LCSDatabase(dataset="cicy"))
db_cicy = CICYDatabase()
print(db_cicy)

Discovering models#

Overview#

db.info() prints a summary of the sub-dataset: total model count, Hodge-number ranges, and the fraction of models with GV invariants and conifold data available. This method operates entirely on the in-memory catalog — no geometry data is downloaded.

db.info()
# Expected output (illustrative):
# CYDatabase — sub-dataset: 'tdf'
#   Total models : 1,247,832
#   h11 range    : 1 – 491
#   h12 range    : 1 – 252
#   With GV data : 85,421  (6.8%)
#   With conifolds: 43,209  (3.5%)
#   Cache dir    : /path/to/.stringforge_cache/tdf

Querying the catalog#

db.query() filters the in-memory catalog and returns a pandas.DataFrame of matching rows. Supported filters:

h11, h12 — exact Hodge number match
has_conifolds — restrict to models with at least one conifold limit
has_gv — restrict to models with stored GV invariants
D3_tadpole_max — upper bound on the D3-tadpole \(\chi/24\)
Any other catalog column as a keyword argument (exact match)

# All two-modulus models (h12 = 2)
df = db.query(h12=2)
print(f"Models with h12=2: {len(df)}")
df.head()

# Two-modulus models with a conifold limit and D3-tadpole <= 48
df_cf = db.query(h12=2, has_conifolds=True, D3_tadpole_max=48)
print(f"Models matching filter: {len(df_cf)}")
df_cf.head()

# Module-level shortcut (no explicit database object needed)
df_all = query_models(dataset="tdf", h12=2)
print(f"Total h12=2 models: {len(df_all)}")

Loading a single model#

db.load() fetches geometry data for a single model and returns an lcs_tree object. For tdf models both ks_id and triang_id are required; for cicy models only cicy_id is needed.

Optional arguments control which additional data splits are fetched:

Argument	Default	Effect
`include_gv`	`False`	Attach GV / GW invariants
`include_conifolds`	`False`	Attach conifold data (`True` = first conifold, `"all"` = list of trees)
`include_extra_data`	`True`	Populate `lcs_tree.extra_data`
`extra_fields`	`None`	Restrict `extra_data` to these columns
`maximum_degree`	`None`	Truncate GV invariants to this degree

# Load a model by its KS identifier — topology data only
ks_id, triang_id = int(df.iloc[0]["ks_id"]), int(df.iloc[0]["triang_id"])

tree = db.load(ks_id=ks_id, triang_id=triang_id)
print(tree)
print(f"h11 = {tree.h11},  h12 = {tree.h12},  chi = {tree.chi}")
print(f"Intersection numbers (COO):\n{tree.intnums_coo}")

# Inspect the extra_data dict — always populated with the primary key and
# any additional precomputed properties stored in the 'extra' split
print(tree.extra_data)

# Load the same model together with GV invariants (truncated to degree 3)
tree_gv = db.load(ks_id=ks_id, triang_id=triang_id,
                  include_gv=True, maximum_degree=3)
print(f"GV charges shape : {tree_gv.gvs['charges'].shape}")
print(f"GV invariants    : {tree_gv.gvs['invariants'][:5]}")

# Load a model that has a conifold limit
row_cf = df_cf.iloc[0]
ks_id_cf, triang_id_cf = int(row_cf["ks_id"]), int(row_cf["triang_id"])

tree_cf = db.load(ks_id=ks_id_cf, triang_id=triang_id_cf,
                  include_conifolds=True)
print(f"ncf              = {tree_cf.ncf}")
print(f"conifold_curve   = {tree_cf.conifold_curve}")

# If a model has multiple conifold limits, pass include_conifolds="all"
# to obtain a list of trees — one per conifold
trees_all_cf = db.load(ks_id=ks_id_cf, triang_id=triang_id_cf,
                       include_conifolds="all")
print(f"Number of conifold trees: {len(trees_all_cf)}")
for t in trees_all_cf:
    print(f"  ncf={t.ncf},  conifold_curve={t.conifold_curve}")

Loading a `FluxVacuaFinder` model#

db.load_model() is a convenience wrapper that calls db.load() and immediately constructs a JAXVacua FluxVacuaFinder object. Additional keyword arguments are forwarded to FluxVacuaFinder.__init__, so you can pass Q, limit, gauge_choice, etc. directly.

# Load directly as a FluxVacuaFinder — ready to use for vacuum search
model = db.load_model(ks_id=ks_id, triang_id=triang_id, Q=24)
print(model)
print(f"D3_tadpole = {model.D3_tadpole}")
print(f"n_fluxes   = {model.n_fluxes}")

# Load with GV corrections enabled
model_gv = db.load_model(ks_id=ks_id, triang_id=triang_id,
                         include_gv=True, maximum_degree=3, Q=24)
print(f"use_gvs = {model_gv.use_gvs}")

Batch loading and random sampling#

For large-scale studies, db.load_batch() and db.sample() return lists of lcs_tree objects. load_batch() accepts a pandas.DataFrame (e.g. the output of db.query()) or a list of tuples.

# Load the first 5 models from a query result
trees = db.load_batch(df.head(5))
print(f"Loaded {len(trees)} trees")
for t in trees:
    print(f"  h11={t.h11}, h12={t.h12}, chi={t.chi}")

# Load from an explicit list of (ks_id, triang_id) tuples
ids = [(int(df.iloc[i]["ks_id"]), int(df.iloc[i]["triang_id"])) for i in range(3)]
trees_explicit = db.load_batch(ids)
print(f"Loaded {len(trees_explicit)} trees from explicit id list")

# Load all models matching a query — no separate query() call needed
trees_h12_2 = db.load_batch(h12=2, include_gv=True, maximum_degree=3)
print(f"Loaded {len(trees_h12_2)} h12=2 models with GV data")

# Draw 10 random h12=2 models — reproducible with seed
sampled = db.sample(n=10, h12=2, seed=42)
print(f"Sampled {len(sampled)} trees")
print(f"h12 values: {[t.h12 for t in sampled]}")

# Sample with conifold data attached
sampled_cf = db.sample(n=5, h12=2, has_conifolds=True,
                       include_conifolds=True, seed=0)
for t in sampled_cf:
    print(f"  ncf={t.ncf},  conifold_curve={t.conifold_curve}")

Working with `extra_data`#

Every loaded lcs_tree carries an extra_data dict populated with the primary-key identifiers and any additional precomputed properties stored in the extra split of the dataset. You can merge in your own computed results using lcs_tree.update().

# Inspect extra_data — always contains the primary key at minimum
print(tree.extra_data)
# e.g. {'ks_id': 12345, 'triang_id': 0, 'h11': 3, 'h12': 2,
#        'chi': -12, 'D3_tadpole': 24, 'n_conifolds': 1, ...}

# Load only specific extra columns to keep memory usage low
tree_slim = db.load(ks_id=ks_id, triang_id=triang_id,
                    extra_fields=["ks_id", "triang_id", "D3_tadpole"])
print(tree_slim.extra_data)

# Annotate a tree with your own computed result
my_result = {"W0_min": 5.547e-5, "source": "arXiv:2501.03984"}
tree_annotated = tree.update(extra_data={**tree.extra_data, **my_result})
print(tree_annotated.extra_data["W0_min"])

Offline mode#

On HPC clusters or systems without internet access, pass offline=True. The database will serve all requests from the local cache and raise a FileNotFoundError if a required file has not been downloaded yet.

The recommended workflow is:

Run once with offline=False (on a machine with internet) to warm the cache for the models and splits you need.
Copy ./.stringforge_cache/ to the HPC node, or point STRINGFORGE_DATA_DIR at a shared read-only cache.
Use offline=True on the HPC node.

# Create an offline database handle pointing at a pre-populated cache
db_offline = TDFDatabase(offline=True)

# This succeeds if the catalog is already cached
df_offline = db_offline.query(h12=2)
print(f"Found {len(df_offline)} h12=2 models in local cache")

Cache management#

clear_cache() deletes all cached shard files from disk. By default, the catalog and any stored vacuum solutions are preserved.

For memory-efficient scanning of many models, use cache_mode="none" — each shard is downloaded, the needed row is extracted, and the file is deleted immediately.

# Delete all cached shards (catalog is preserved)
# db.clear_cache()

# For scanning millions of models without disk accumulation:
# db_lean = TDFDatabase(cache_mode="none")
# tree = db_lean.load(ks_id=ks_id, triang_id=triang_id)  # shard deleted after read

Module-level convenience functions#

For one-off loads without creating an LCSDatabase object, three module-level functions are available:

# Load a single tdf model by its KS identifiers
tree_quick = load_tdf_model(ks_id=ks_id, triang_id=triang_id)
print(f"h11={tree_quick.h11}, h12={tree_quick.h12}")

# Load a single CICY model
tree_cicy = load_cicy_model(cicy_id=7890)
print(f"h11={tree_cicy.h11}, h12={tree_cicy.h12}")

# Query without instantiating a database object
df_quick = query_models(dataset="tdf", h12=2, has_conifolds=True)
print(f"h12=2 models with conifolds: {len(df_quick)}")