Name: Binary Classification
Author: brojonat

搜索技能.../

<project>/
├── data/                # input parquet/csv
├── src/
│   ├── train.py         # ibis read → Pipeline + XGBClassifier → MLflow log
│   ├── predict.py       # reload model, apply tuned threshold
│   └── plots.py         # ROC, PR, calibration, threshold sweep, SHAP
├── notebooks/
│   └── demo.py          # marimo walkthrough
└── mlruns/              # MLflow tracking store (gitignored)

import ibis

table = ibis.duckdb.connect().read_parquet("data/train.parquet")
feature_cols = [c for c in table.columns if c.startswith("feature_")]

# Class balance via an ibis aggregation (pushed down to DuckDB)
class_stats = (
    table
    .aggregate(
        n_pos=table.target.sum().cast("int64"),
        n_total=table.count(),
    )
    .execute()
    .iloc[0]
)
n_pos = int(class_stats["n_pos"])
scale_pos_weight = (int(class_stats["n_total"]) - n_pos) / n_pos

# Materialize features + target — the ibis → pandas boundary
data = (
    table
    .select(*feature_cols, "target")
    .execute()
)
X = data[feature_cols]
y = data["target"].astype(int)

# GOOD — single chain, reads as a recipe
data = (
    table
    .filter(table.target.notnull())
    .select(*feature_cols, "target")
    .execute()
)

# BAD — fragmented across mutations
table = table.filter(table.target.notnull())
table = table.select(*feature_cols, "target")
data = table.execute()

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from xgboost import XGBClassifier

def build_pipeline(scale_pos_weight: float, seed: int) -> Pipeline:
    return Pipeline([
        ("preprocess", ColumnTransformer([
            ("num", StandardScaler(), numeric_cols),
            ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ])),
        ("clf", XGBClassifier(
            n_estimators=300,
            max_depth=4,
            learning_rate=0.05,
            subsample=0.8,
            colsample_bytree=0.8,
            reg_lambda=1.0,
            scale_pos_weight=scale_pos_weight,
            objective="binary:logistic",
            eval_metric="logloss",
            random_state=seed,
            n_jobs=-1,
        )),
    ])

n_pos = int(y_train.sum())
n_neg = int(len(y_train) - n_pos)
scale_pos_weight = n_neg / n_pos

import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score

proba = pipeline.predict_proba(X_val)[:, 1]
thresholds = np.linspace(0.01, 0.99, 99)

# Optimize for F1 (balanced precision and recall)
f1s = [f1_score(y_val, (proba >= t).astype(int)) for t in thresholds]
best_threshold = thresholds[int(np.argmax(f1s))]

# E.g. fraud detection where each FN costs $100 and each FP costs $5
def expected_cost(y_true, y_pred, fp_cost, fn_cost):
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()
    return fp * fp_cost + fn * fn_cost

best_threshold = min(
    thresholds,
    key=lambda t: expected_cost(y_val, (proba >= t).astype(int), 5, 100),
)

from sklearn.metrics import brier_score_loss
from sklearn.calibration import calibration_curve

brier = brier_score_loss(y_test, proba)
frac_pos, mean_pred = calibration_curve(y_test, proba, n_bins=10, strategy="quantile")

from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(pipeline, method="isotonic", cv="prefit")
calibrated.fit(X_val, y_val)

Binary Classification | Skills Pool

Binary Classification

Binary Classification

Binary Classification with XGBoost (Done Right)

When to use this skill

When NOT to use this skill

Project layout

Data access — ibis at the source, pandas at the sklearn boundary

The pipeline

The four things that separate this from a tutorial

1. Class imbalance — `scale_pos_weight`, not resampling

2. Threshold tuning — 0.5 is rarely the right cutoff

3. Calibration verification — Brier score + reliability diagram

4. SHAP for feature importance

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Binary Classification

Binary Classification

Binary Classification with XGBoost (Done Right)

When to use this skill

When NOT to use this skill

Project layout

Data access — ibis at the source, pandas at the sklearn boundary

The pipeline

The four things that separate this from a tutorial

1. Class imbalance — scale_pos_weight, not resampling

2. Threshold tuning — 0.5 is rarely the right cutoff

3. Calibration verification — Brier score + reliability diagram

4. SHAP for feature importance

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

1. Class imbalance — `scale_pos_weight`, not resampling