模型验证:交叉验证、留出法、残差分析、与已知解对比、假设检验。触发词: 模型验证、交叉验证、残差分析、model validation、留出法、误差分析、假设检验。
执行描述: $ARGUMENTS
MODEL_VALIDATION_REPORT.md — 主输出文件,供 paper-write、model-review 使用。figures/validation/ — 验证图表存放目录。artifacts/ — 中间结构化文件存放目录。5 — 交叉验证默认折数。0.20 — 留出法默认测试集比例。42 — 随机种子,确保可复现。0.05 — 残差正态性检验显著性水平。0.05 — 残差方差齐性检验显著性水平。0.05 — 残差独立性检验显著性水平(Durbin-Watson)。5 — 与已知解对比时最多使用 5 个基准。gpt-5.4 — Codex MCP 交叉验证模型。high — 模型验证审查使用高推理强度。0.70 — R-squared 低于 0.70 需要在报告中标记为"拟合不足"。20 — 单次验证计算的时间上限。Input: $ARGUMENTS、已有模型文件和数据。
Output: artifacts/validation_scope.json(模型信息与验证计划)。
$ARGUMENTS 是文件路径,读取模型代码或结果文件。MODEL_REPORT.md、SOLVE_PLAN.md、FINAL_PROPOSAL.mdscripts/、src/ 目录中的模型实现代码results/ 目录中的求解结果data/cleaned/ 目录中的清洗后数据artifacts/validation_scope.json。import json
from pathlib import Path
def determine_validation_strategy(model_type: str, n_samples: int, n_features: int) -> dict:
"""Determine appropriate validation methods based on model type and data size."""
strategy = {
"model_type": model_type,
"n_samples": n_samples,
"n_features": n_features,
"methods": [],
}
if model_type in ["regression", "classification"]:
if n_samples >= 500:
strategy["methods"].append({"name": "k_fold_cv", "k": 5})
if n_samples >= 100:
strategy["methods"].append({"name": "holdout", "test_ratio": 0.20})
strategy["methods"].append({"name": "residual_analysis"})
elif model_type == "optimization":
strategy["methods"].append({"name": "benchmark_comparison"})
strategy["methods"].append({"name": "convergence_analysis"})
strategy["methods"].append({"name": "relaxation_bound"})
elif model_type == "prediction":
strategy["methods"].append({"name": "temporal_holdout", "test_ratio": 0.20})
strategy["methods"].append({"name": "sliding_window", "window_size": "auto"})
strategy["methods"].append({"name": "prediction_interval"})
elif model_type == "simulation":
strategy["methods"].append({"name": "real_data_comparison"})
strategy["methods"].append({"name": "extreme_scenario_test"})
# Universal methods
strategy["methods"].append({"name": "hypothesis_tests"})
return strategy
Path("artifacts").mkdir(exist_ok=True)
# strategy = determine_validation_strategy("regression", n_samples=1000, n_features=10)
# Path("artifacts/validation_scope.json").write_text(
# json.dumps(strategy, ensure_ascii=False, indent=2), encoding="utf-8"
# )
Input: 模型、数据、验证计划。
Output: artifacts/cv_results.json、性能指标汇总。
留出法 (Holdout):
DEFAULT_TEST_RATIO (20%) 划分训练集和测试集。DEFAULT_RANDOM_SEED 确保可复现。K-fold 交叉验证:
DEFAULT_K_FOLDS (5) 折。时间序列验证(如适用):
性能指标根据模型类型选择:
如果 R-squared < ACCEPTABLE_R2_THRESHOLD,在报告中标红。
对比训练集和测试集性能,差距 > 10% 标记为过拟合风险。
import numpy as np
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
def holdout_validation(model, X, y, test_ratio=0.20, seed=42):
"""Run holdout validation."""
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio, random_state=seed)
model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
return {
"train": {
"r2": round(float(r2_score(y_train, y_pred_train)), 4),
"rmse": round(float(np.sqrt(mean_squared_error(y_train, y_pred_train))), 4),
"mae": round(float(mean_absolute_error(y_train, y_pred_train)), 4),
},
"test": {
"r2": round(float(r2_score(y_test, y_pred_test)), 4),
"rmse": round(float(np.sqrt(mean_squared_error(y_test, y_pred_test))), 4),
"mae": round(float(mean_absolute_error(y_test, y_pred_test)), 4),
},
"n_train": len(X_train),
"n_test": len(X_test),
"overfit_risk": abs(r2_score(y_train, y_pred_train) - r2_score(y_test, y_pred_test)) > 0.10,
}
def kfold_validation(model, X, y, k=5, seed=42):
"""Run K-fold cross-validation."""
kf = KFold(n_splits=k, shuffle=True, random_state=seed)
fold_results = []
for fold_idx, (train_idx, test_idx) in enumerate(kf.split(X)):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
fold_results.append({
"fold": fold_idx + 1,
"r2": round(float(r2_score(y_test, y_pred)), 4),
"rmse": round(float(np.sqrt(mean_squared_error(y_test, y_pred))), 4),
"mae": round(float(mean_absolute_error(y_test, y_pred)), 4),
})
r2_values = [f["r2"] for f in fold_results]
return {
"k": k,
"folds": fold_results,
"r2_mean": round(float(np.mean(r2_values)), 4),
"r2_std": round(float(np.std(r2_values)), 4),
"r2_cv": round(float(np.std(r2_values) / np.mean(r2_values)), 4) if np.mean(r2_values) != 0 else None,
"stable": float(np.std(r2_values) / max(abs(np.mean(r2_values)), 1e-10)) < 0.20,
}
Input: 模型预测值、实际值。
Output: artifacts/residual_analysis.json、残差诊断图。
residuals = y_actual - y_predicted。RESIDUAL_NORMALITY_ALPHA,残差不服从正态分布。HOMOSCEDASTICITY_ALPHA,存在异方差性。import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from scipy import stats
def residual_analysis(y_actual, y_predicted, fig_dir="figures/validation"):
"""Comprehensive residual analysis."""
residuals = np.array(y_actual) - np.array(y_predicted)
fitted = np.array(y_predicted)
n = len(residuals)
# Normality test
if n < 5000:
stat_norm, p_norm = stats.shapiro(residuals)
norm_test = "Shapiro-Wilk"
else:
stat_norm, p_norm = stats.kstest(residuals, "norm", args=(np.mean(residuals), np.std(residuals)))
norm_test = "Kolmogorov-Smirnov"
# Independence test (Durbin-Watson)
diff = np.diff(residuals)
dw = float(np.sum(diff ** 2) / np.sum(residuals ** 2))
# Standardized residuals
std_residuals = (residuals - np.mean(residuals)) / np.std(residuals)
report = {
"n_observations": n,
"residual_mean": round(float(np.mean(residuals)), 6),
"residual_std": round(float(np.std(residuals)), 6),
"normality": {
"test": norm_test,
"statistic": round(float(stat_norm), 6),
"p_value": round(float(p_norm), 6),
"is_normal": bool(p_norm >= 0.05),
},
"independence": {
"durbin_watson": round(dw, 4),
"interpretation": "no autocorrelation" if 1.5 < dw < 2.5 else "possible autocorrelation",
},
}
# Generate diagnostic plots
Path(fig_dir).mkdir(parents=True, exist_ok=True)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 1. Residuals vs Fitted
axes[0, 0].scatter(fitted, residuals, alpha=0.5, s=10, color="#2196F3")
axes[0, 0].axhline(y=0, color="red", linestyle="--", linewidth=1)
axes[0, 0].set_xlabel("Fitted Values")
axes[0, 0].set_ylabel("Residuals")
axes[0, 0].set_title("Residuals vs Fitted")
# 2. Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title("Normal Q-Q Plot")
# 3. Scale-Location
axes[1, 0].scatter(fitted, np.sqrt(np.abs(std_residuals)), alpha=0.5, s=10, color="#4CAF50")
axes[1, 0].set_xlabel("Fitted Values")
axes[1, 0].set_ylabel("sqrt(|Standardized Residuals|)")
axes[1, 0].set_title("Scale-Location Plot")
# 4. Residual Histogram
axes[1, 1].hist(residuals, bins=30, density=True, alpha=0.7, color="#FF9800", edgecolor="white")
x_range = np.linspace(residuals.min(), residuals.max(), 100)
axes[1, 1].plot(x_range, stats.norm.pdf(x_range, np.mean(residuals), np.std(residuals)),
color="red", linewidth=2, label="Normal fit")
axes[1, 1].set_title("Residual Distribution")
axes[1, 1].legend()
fig.suptitle("Residual Diagnostic Plots", fontsize=14, fontweight="bold")
fig.tight_layout()
fig.savefig(f"{fig_dir}/residual_diagnostics.png", dpi=150)
plt.close(fig)
return report
Input: 模型结果、已知解或基准方法结果。
Output: artifacts/benchmark_comparison.json。
def benchmark_comparison(model_results: dict, benchmarks: list) -> dict:
"""Compare model results against benchmark solutions."""
comparisons = []
for bench in benchmarks:
comparison = {
"benchmark_name": bench["name"],
"benchmark_value": bench["value"],
"model_value": model_results.get("objective", None),
}
if comparison["model_value"] is not None and comparison["benchmark_value"] is not None:
abs_diff = abs(comparison["model_value"] - comparison["benchmark_value"])
rel_diff = abs_diff / abs(comparison["benchmark_value"]) if comparison["benchmark_value"] != 0 else float("inf")
comparison["absolute_error"] = round(float(abs_diff), 6)
comparison["relative_error"] = round(float(rel_diff), 6)
comparison["improvement_pct"] = round(
float((comparison["model_value"] - comparison["benchmark_value"]) / abs(comparison["benchmark_value"]) * 100),
2
) if comparison["benchmark_value"] != 0 else None
comparisons.append(comparison)
return {"comparisons": comparisons}
Input: Phase 2-4 的所有统计检验结果。
Output: artifacts/hypothesis_tests_summary.json。
def summarize_hypothesis_tests(test_results: list) -> dict:
"""Summarize all hypothesis tests performed during validation."""
summary = {"tests": [], "pass_count": 0, "fail_count": 0, "warnings": []}
for test in test_results:
passed = test.get("p_value", 1.0) >= test.get("alpha", 0.05)
entry = {
"test_name": test["name"],
"null_hypothesis": test["h0"],
"statistic": test["statistic"],
"p_value": test["p_value"],
"alpha": test.get("alpha", 0.05),
"conclusion": "fail to reject H0" if passed else "reject H0",
"passed": passed,
}
summary["tests"].append(entry)
if passed:
summary["pass_count"] += 1
else:
summary["fail_count"] += 1
if test.get("critical", False):
summary["warnings"].append(
f"Critical test '{test['name']}' failed (p={test['p_value']:.4f}). "
f"Model assumption may be violated."
)
return summary
Input: 所有验证结果汇总。 Output: 修订建议。
gpt-5.4 与 high 推理强度做独立审查。threadId,后续可追问具体改进方向。mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "high"}
prompt: |
你是数学建模竞赛模型验证专家,请审查以下验证结果。
模型概述:
[粘贴模型类型和基本信息]
验证结果汇总:
[粘贴 CV 结果、残差分析、基准对比、假设检验]
关键指标:
- R-squared: [值]
- RMSE: [值]
- 过拟合风险: [是/否]
- 未通过的假设检验: [列表]
请逐项回答:
1. 当前的验证方法是否充分?还需要补充哪些验证?
2. 统计检验结果的解读是否正确?
3. 模型的主要局限性是什么?如何在论文中诚实但得体地表述?
4. 与基准方法的对比是否公平合理?
5. 对提高模型可信度,你有什么具体建议?
将最终结果写入 MODEL_VALIDATION_REPORT.md。
| 验证方法 | 关键指标 | 结果 | 通过/警告 |
|---|---|---|---|
| 留出法 | R-squared (test) | 0.85 | PASS |
| 5-fold CV | R-squared (mean +/- std) | 0.83 +/- 0.04 | PASS |
| 残差正态性 | Shapiro-Wilk p-value | 0.12 | PASS |
| 残差齐性 | Breusch-Pagan p-value | 0.03 | WARN |
| 残差独立性 | Durbin-Watson | 1.89 | PASS |
| 基准对比 | vs 线性回归 | +15% R-squared | PASS |
Output files:
| 路径 | 必需 | 内容 | 说明 |
|---|---|---|---|
MODEL_VALIDATION_REPORT.md | Yes | 主报告 | 提供给 paper-write 和 model-review |
figures/validation/residual_diagnostics.png | Yes | 残差诊断四联图 | 可直接用于论文 |
figures/validation/cv_performance.png | Recommended | CV 各折性能 | 展示模型稳定性 |
figures/validation/predicted_vs_actual.png | Recommended | 预测值 vs 实际值 | 直观展示拟合效果 |
figures/validation/benchmark_comparison.png | Recommended | 基准对比图 | 展示模型优势 |
artifacts/validation_scope.json | Recommended | 验证计划 | 机器可读 |
artifacts/cv_results.json | Recommended | CV 结果 | 支持复查 |
artifacts/residual_analysis.json | Recommended | 残差分析结果 | 支持复查 |
artifacts/benchmark_comparison.json | Recommended | 基准对比结果 | 支持复查 |
artifacts/hypothesis_tests_summary.json | Recommended | 假设检验汇总 | 支持复查 |
cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.DEFAULT_RANDOM_SEED),确保可复现。Open Questions,为后续改进提供方向。model-creator
-> solve-plan
-> run-solver
-> analyze-results
-> model-validation
-> sensitivity-analysis
-> paper-write
model-validation
-> paper-figure (提供验证图表)
-> paper-write (提供模型验证章节素材)
-> model-review (评委关注验证充分性)
-> sensitivity-analysis (验证后做灵敏度分析)