ml_view#
ML-friendly view of a Vulcan repo.
ML training jobs need a stable train/validation/test split that
never leaks the same Calabi-Yau geometry across splits. Vulcan rows
carry a geometry_id() – a
content-addressed hash of the model – which we use as the natural
shard key: every row sharing a geometry_id lands in exactly one
split.
This module exposes VulcanMLView, a thin layer over
stringforge.vulcan.VulcanReader that:
assigns each geometry to a fixed split deterministically;
projects rows down to a feature list and optionally pads variable- length columns (
flux,moduli_re,moduli_im,F_terms_*) to fixed-length tensors;hands the result back as either a
pandas.DataFrameor, whendatasetsis installed, a streamingdatasets.Dataset.
- stringforge.vulcan.ml_view.DEFAULT_FRACTIONS: Tuple[float, ...] = (0.8, 0.1, 0.1)#
Default split fractions (train, val, test). Sums to 1.0.
- stringforge.vulcan.ml_view.DEFAULT_SPLITS: Tuple[str, ...] = ('train', 'val', 'test')#
Recognised splits. Two- or three-way slicing is the standard pattern; downstream code that wants a different scheme can call
assign_split()directly with custom fractions.
- class stringforge.vulcan.ml_view.FeatureSpec(features=None, list_max_len=None, list_fill=0, keep_geometry_id=True)#
Bases:
objectSpec for shaping Vulcan rows into ML features.
- features#
Column projection. Optional; when
Noneevery non-geometry column is kept.
- list_max_len#
Fixed length for variable-length list columns.
Noneleaves them as Python lists. Rows longer thanlist_max_lenare silently right-truncated. Callers needing strict-length checks should pre-checkmax(len(x) for x in series)againstlist_max_len.
- list_fill#
Pad value when
list_max_lenis set.
- keep_geometry_id#
Whether to retain
geometry_idin the returned DataFrame – typically wanted for monitoring held-out performance per geometry.
- class stringforge.vulcan.ml_view.VulcanMLView(reader, *, fractions=(0.8, 0.1, 0.1), splits=('train', 'val', 'test'), salt='', feature_spec=None)#
Bases:
objectTrain/val/test view of a Vulcan source.
Wraps a
stringforge.vulcan.VulcanReaderwith deterministic, geometry-disjoint splits and a configurableFeatureSpec. Two consumption paths:as_dataframe()– materialise a pandas DataFrame for light workflows.as_hf_dataset()– materialise a streamingdatasets.Dataset(requires thedatasetspackage).
- Parameters:
reader (
VulcanReader) – Source reader.fractions (
Sequence[float]) – Per-split fractions (must sum to 1.0).salt (
str) – Optional salt for cross-validation folds.feature_spec (
Optional[FeatureSpec]) – DefaultFeatureSpec; can be overridden per call.
- as_dataframe(split, *, feature_spec=None, query_filters=None)#
Materialise the rows of a specific split as a pandas DataFrame.
The split is determined by hashing
geometry_id; rows from the same geometry are guaranteed to land in the same split.- Parameters:
- Returns:
pd.DataFrame – The shaped feature table.
- Raises:
ValueError – When
splitis not a recognised label.- Return type:
DataFrame
- as_hf_dataset(split, *, feature_spec=None, query_filters=None)#
Materialise the rows of a specific split as a HuggingFace
datasets.Datasetfor streaming consumption by an ML framework.datasetsis an optional dependency; this method imports it lazily and raises a clear error when it is not installed.- Parameters:
- Returns:
datasets.Dataset – A HuggingFace dataset.
- Raises:
ImportError – When the optional
datasetspackage is not installed.- Return type:
- split_assignments()#
Return a two-column table of
(geometry_id, split)for every geometry the reader can see.- Returns:
pd.DataFrame – One row per distinct geometry.
- Return type:
DataFrame
- stringforge.vulcan.ml_view.assign_split(geometry_id, *, fractions=(0.8, 0.1, 0.1), splits=('train', 'val', 'test'), salt='')#
Map a
geometry_idto one of the supplied splits deterministically.The function hashes
salt + geometry_idto a uniform point in[0, 1)and uses the cumulative-fraction breakpoints to pick a split. Same(geometry_id, salt, fractions, splits)always returns the same split.- Parameters:
- Returns:
str – One of the labels in
splits.- Raises:
ValueError – When
fractionsdoes not matchsplitsin length, contains a negative value, or sums to something other than 1.0.- Return type:
- stringforge.vulcan.ml_view.shape_features(df, spec)#
Apply a
FeatureSpecto a Vulcan-shaped DataFrame.The returned DataFrame contains exactly the projected feature columns (plus
geometry_idwhenspec.keep_geometry_idis True) with list columns padded tospec.list_max_lenwhere applicable.- Parameters:
df (
DataFrame) – Source rows (typically the output ofstringforge.vulcan.VulcanReader.query()).spec (
FeatureSpec) – The feature spec.
- Returns:
pd.DataFrame – Shaped feature table.
- Raises:
KeyError – When
spec.featuresreferences a column missing fromdf.- Return type:
DataFrame