Build a tabular ML project with scikit-learn Pipelines, MLflow tracking, model artifacts, and a marimo demo notebook. Use when starting any new tabular classification or regression bundle so all bundles share the same plumbing.
The reference layout every tabular bundle in ManagerPack copies. The goal is consistency: same project structure, same MLflow conventions, same load/predict path, same notebook style. Whatever the actual problem, the plumbing is identical.
The worked example is "is this coin fair?" — a logistic regression on
(flip_index, outcome). The model is trivial, the plumbing is the
point.
<bundle>/
├── README.md # what this bundle does + how to run it
├── SKILL.md # this file (or specialized for the bundle)
├── src/
│ ├── train.py # train + log to MLflow
│ ├── predict.py # load model from MLflow and predict
│ └── plots.py # plot helpers, logged as MLflow artifacts
├── notebooks/
│ └── <name>_demo.py # marimo notebook with mo.ui.slider
└── mlruns/ # MLflow tracking store (gitignored)
The data lives outside the bundle, in studio/data/<problem>.parquet,
generated by datagen <problem>. Bundles never carry their own data —
they consume parquet from the studio's shared data directory.
Always wrap preprocessing inside the sklearn Pipeline so it travels
with the model on save/load:
Pipeline([
("preprocess", ColumnTransformer([
("scaled", StandardScaler(), numeric_cols),
("encoded", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
])),
("clf", LogisticRegression(max_iter=1000)),
])
Never separate the preprocessing step from the model — the loaded artifact must work standalone with raw input.
Every run logs:
| Kind | What |
|---|---|
params | data path, n_rows, seed, test_size, cv_folds, model name, hyperparameters |
metrics | cv mean & std of the held-out metric, test set score, recovery error when ground truth is known |
tags | data_hash (sha256 prefix), true_* ground truth values from the sidecar |
artifacts | model (via mlflow.sklearn.log_model), plots/, data/sidecar.json |
Recovery error against ground truth is the most important metric for template runs because it answers "did the model recover what we know to be true?"
Use mlflow.sklearn.log_model(sk_model=pipeline, name="model", input_example=X_train.head(5)).
Never use bare joblib.dump() or pickle.dump(). The MLflow path:
mlflow.sklearn.load_model("runs:/<id>/model")skops.
For internal templates this is fine; for security-sensitive deployments,
use the skops format instead.import mlflow
import mlflow.sklearn
mlflow.set_tracking_uri(f"file:{template_dir / 'mlruns'}")
model = mlflow.sklearn.load_model(f"runs:/{run_id}/model")
predictions = model.predict_proba(new_data)
The Pipeline (preprocessing + classifier) comes back as one object. No need to re-import the training code.
When the model is linear, log the interpretable quantities as metrics so they show up in the MLflow UI:
clf = pipeline.named_steps["clf"]
mlflow.log_metric("intercept_logit", float(clf.intercept_[0]))
mlflow.log_metric("coef_some_feature", float(clf.coef_[0][feature_idx]))
For the fair-coin template, the intercept (in logit space) maps directly
to $P(\text{heads})$ at the mean of the standardized index, and the slope
on flip_index is a non-stationarity detector. Always interpret what
the coefficients mean in domain terms.
Generate matplotlib figures in src/plots.py, save them to a temp
directory, and log them as MLflow artifacts under plots/. Always