ml_view#

ML-friendly view of a Vulcan repo.

ML training jobs need a stable train/validation/test split that never leaks the same Calabi-Yau geometry across splits. Vulcan rows carry a geometry_id() – a content-addressed hash of the model – which we use as the natural shard key: every row sharing a geometry_id lands in exactly one split.

This module exposes VulcanMLView, a thin layer over stringforge.vulcan.VulcanReader that:

  • assigns each geometry to a fixed split deterministically;

  • projects rows down to a feature list and optionally pads variable- length columns (flux, moduli_re, moduli_im, F_terms_*) to fixed-length tensors;

  • hands the result back as either a pandas.DataFrame or, when datasets is installed, a streaming datasets.Dataset.

stringforge.vulcan.ml_view.DEFAULT_FRACTIONS: Tuple[float, ...] = (0.8, 0.1, 0.1)#

Default split fractions (train, val, test). Sums to 1.0.

stringforge.vulcan.ml_view.DEFAULT_SPLITS: Tuple[str, ...] = ('train', 'val', 'test')#

Recognised splits. Two- or three-way slicing is the standard pattern; downstream code that wants a different scheme can call assign_split() directly with custom fractions.

class stringforge.vulcan.ml_view.FeatureSpec(features=None, list_max_len=None, list_fill=0, keep_geometry_id=True)#

Bases: object

Spec for shaping Vulcan rows into ML features.

features#

Column projection. Optional; when None every non-geometry column is kept.

list_max_len#

Fixed length for variable-length list columns. None leaves them as Python lists. Rows longer than list_max_len are silently right-truncated. Callers needing strict-length checks should pre-check max(len(x) for x in series) against list_max_len.

list_fill#

Pad value when list_max_len is set.

keep_geometry_id#

Whether to retain geometry_id in the returned DataFrame – typically wanted for monitoring held-out performance per geometry.

class stringforge.vulcan.ml_view.VulcanMLView(reader, *, fractions=(0.8, 0.1, 0.1), splits=('train', 'val', 'test'), salt='', feature_spec=None)#

Bases: object

Train/val/test view of a Vulcan source.

Wraps a stringforge.vulcan.VulcanReader with deterministic, geometry-disjoint splits and a configurable FeatureSpec. Two consumption paths:

  • as_dataframe() – materialise a pandas DataFrame for light workflows.

  • as_hf_dataset() – materialise a streaming datasets.Dataset (requires the datasets package).

Parameters:
as_dataframe(split, *, feature_spec=None, query_filters=None)#

Materialise the rows of a specific split as a pandas DataFrame.

The split is determined by hashing geometry_id; rows from the same geometry are guaranteed to land in the same split.

Parameters:
  • split (str) – Label to materialise (must be in splits).

  • feature_spec (Optional[FeatureSpec]) – Override the default feature spec for this call.

  • query_filters (Optional[Mapping[str, Any]]) – Optional extra filters forwarded to VulcanReader.query() to scope the input rows.

Returns:

pd.DataFrame – The shaped feature table.

Raises:

ValueError – When split is not a recognised label.

Return type:

DataFrame

as_hf_dataset(split, *, feature_spec=None, query_filters=None)#

Materialise the rows of a specific split as a HuggingFace datasets.Dataset for streaming consumption by an ML framework.

datasets is an optional dependency; this method imports it lazily and raises a clear error when it is not installed.

Parameters:
Returns:

datasets.Dataset – A HuggingFace dataset.

Raises:

ImportError – When the optional datasets package is not installed.

Return type:

Any

split_assignments()#

Return a two-column table of (geometry_id, split) for every geometry the reader can see.

Returns:

pd.DataFrame – One row per distinct geometry.

Return type:

DataFrame

split_of(geometry_id)#

Return the split label assigned to a specific geometry.

Parameters:

geometry_id (str) – The model identifier to look up.

Returns:

str – One of splits.

Return type:

str

stringforge.vulcan.ml_view.assign_split(geometry_id, *, fractions=(0.8, 0.1, 0.1), splits=('train', 'val', 'test'), salt='')#

Map a geometry_id to one of the supplied splits deterministically.

The function hashes salt + geometry_id to a uniform point in [0, 1) and uses the cumulative-fraction breakpoints to pick a split. Same (geometry_id, salt, fractions, splits) always returns the same split.

Parameters:
  • geometry_id (str) – The Vulcan content-addressed model identifier.

  • fractions (Sequence[float]) – Per-split fractions. Must sum to 1.0; order must match splits.

  • splits (Sequence[str]) – Split labels.

  • salt (str) – Optional salt to perturb the assignment without changing geometry ids – useful when running cross-validation folds.

Returns:

str – One of the labels in splits.

Raises:

ValueError – When fractions does not match splits in length, contains a negative value, or sums to something other than 1.0.

Return type:

str

stringforge.vulcan.ml_view.shape_features(df, spec)#

Apply a FeatureSpec to a Vulcan-shaped DataFrame.

The returned DataFrame contains exactly the projected feature columns (plus geometry_id when spec.keep_geometry_id is True) with list columns padded to spec.list_max_len where applicable.

Parameters:
  • df (DataFrame) – Source rows (typically the output of stringforge.vulcan.VulcanReader.query()).

  • spec (FeatureSpec) – The feature spec.

Returns:

pd.DataFrame – Shaped feature table.

Raises:

KeyError – When spec.features references a column missing from df.

Return type:

DataFrame