Evaluates Lingliang primary school math exam AI grading results against ground truth raw scores. Computes Spearman ρ, Pearson r, MAE, RMSE, and ±5-point accuracy for 20 students.
../grading_workspace/output/lingliang/
final_scores.json, student_01.json … student_26.json (20 grading students, excluding reference students 07, 08, 11, 17, 23)data/groundtruth/lingliang/groundtruth_mapping.json
total_score, max_score, score_percentage per student (NOT levels)The full evaluation workflow follows the unified skill adapted for this subject. The complete content is reproduced below for reference.
This skill evaluates the accuracy of AI-generated grades for the Lingliang primary school math exam by comparing them against ground truth raw scores (total out of 100). It computes statistical metrics including correlation coefficients and error measures, generates visualization figures, and produces comprehensive DOCX evaluation reports.
Prerequisites: This skill requires completed grading output from the lingliang grading
skill. Grading output is accessed from ../grading_workspace/output/lingliang/. Ground
truth is local at data/groundtruth/lingliang/groundtruth_mapping.json.
../grading_workspace/output/lingliang/student_XX.json — Per-student grading results../grading_workspace/output/lingliang/final_scores.json — Aggregated scores with statisticsdata/groundtruth/lingliang/groundtruth_mapping.jsonGround truth is raw scores (total_score out of 100), NOT ordinal levels. Use both rank-based (Spearman) and absolute-error (MAE, RMSE) metrics.
Primary metrics include both rank-order (Spearman ρ) and error metrics (Pearson r, MAE, RMSE, ±5-point accuracy). These together measure whether the AI grading produces scores that correlate with and closely approximate ground truth scores.
No level division data — evaluate directly against raw score ground truth. There are no DSE levels (1–5) in this subject.
Model restrictions:
model
parameter override to sub-agents — let them inherit the main agent's model.kimi-k2.5)
exclusively. NO OTHER VLM MODELS ARE ALLOWED.No LLM/VLM API calls for generating analysis or report text. See WARNING.md.
All commentary and analysis must be generated by sub-agents.
Paths are fixed — no year variable needed. All paths use lingliang directly
without year nesting (e.g., output/lingliang/, data/groundtruth/lingliang/).
All report and commentary text must be written in Traditional Chinese (繁體中文). This applies to all narrative, feedback, analysis, and section content in generated DOCX reports. English is only permitted for: variable names, file paths, technical identifiers, chart axis labels, and JSON field names.
lingliang-grading-workspaces/evaluation_workspace/
├── .github/skills/lingliang-evaluation/SKILL.md
├── START.md
├── WARNING.md
├── env.txt
├── env.txt.example
├── pyproject.toml
├── uv.lock
├── start.sh
├── data/ # Local data (pre-copied into workspace)
│ └── groundtruth/lingliang/groundtruth_mapping.json
├── evaluation/lingliang/ # Evaluation output (this skill generates these)
│ ├── metrics.json # Machine-readable metrics
│ ├── rubric_quality_benchmark.json # Agent-facing rubric quality benchmark
│ ├── grading_quality_audit.json # Per-student grading quality audit
│ ├── grading_consistency_audit.json # Cross-student same-question consistency audit
│ ├── grading_quality_report.md # Diagnostic markdown summary
│ ├── eval_report.docx # Class-level evaluation report
│ ├── figures/ # Evaluation charts (PNG)
│ │ ├── score_scatter.png
│ │ ├── error_distribution.png
│ │ └── per_student_table.png
│ └── students/ # Individual student evaluation reports
│ └── student_XX.docx
└── scripts/ # Python helper scripts
├── config.py # Path configuration
├── validate_extraction.py
├── validate_grading_output.py
└── validate_reports.py
Per-student JSON (output/lingliang/student_XX.json):
{
"student_id": 1,
"questions": [...],
"total_raw_score": 77,
"total_max_score": 100,
"percentage": 77.0
}
Final scores (output/lingliang/final_scores.json):
{
"subject": "lingliang",
"total_students": 20,
"students": [
{"student_id": 1, "total_raw_score": 77, "percentage": 77.0}
]
}
{
"mappings": [
{
"student_id": 1,
"student_label": "student_01",
"student_name": "陳喜楓",
"filename": "student_01.pdf",
"total_score": 77,
"max_score": 100,
"score_percentage": 77.0
}
]
}
source env.txt
uv sync
Verify all required inputs exist:
../grading_workspace/output/lingliang/final_scores.json../grading_workspace/output/lingliang/student_01.json etc. (20 grading students)data/groundtruth/lingliang/groundtruth_mapping.jsonstudent_id → total_score, score_percentage pairsstudent_id → total_raw_score, percentageCompute the following metrics:
| Metric | Description | Formula |
|---|---|---|
| Spearman ρ | Rank correlation between GT score and predicted score | scipy.stats.spearmanr |
| Pearson r | Linear correlation between GT score and predicted score | scipy.stats.pearsonr |
| MAE | Mean absolute error between GT and predicted scores | mean(|predicted - gt|) |
| RMSE | Root mean squared error | sqrt(mean((predicted - gt)²)) |
| ±5-point accuracy | % of students where |predicted - GT score| ≤ 5 | count(|pred-gt|≤5) / N × 100 |
| ±10-point accuracy | % of students where |predicted - GT score| ≤ 10 | count(|pred-gt|≤10) / N × 100 |
| Max error | Largest absolute error across all students | max(|predicted - gt|) |
Save to evaluation/lingliang/metrics.json:
{
"subject": "lingliang",
"total_students": 20,
"metrics": {
"spearman_rho": 0.85,
"spearman_p_value": 0.001,
"pearson_r": 0.88,
"pearson_p_value": 0.0001,
"mae": 4.2,
"rmse": 5.8,
"within_5_points_pct": 65.0,
"within_10_points_pct": 90.0,
"max_error": 15
},
"per_student": [
{
"student_id": 1,
"gt_score": 77,
"gt_percentage": 77.0,
"predicted_score": 74,
"predicted_percentage": 74.0,
"score_diff": -3
}
]
}
ℹ️ This phase runs after Phase 2. It acts as an independent judge to assess whether the grading output (
output/lingliang/student_XX.json) contains correct marks, well-grounded reasoning, and accurate evidence — verified against the actual rubric and student answers.
You MUST NOT write Python scripts that use keyword matching, regex patterns, string length checks, or any other heuristic rules to evaluate grading quality. Such approaches cannot determine whether reasoning is actually correct or whether evidence actually matches the student's answer — they can only check if certain words appear in text, which is fundamentally insufficient and produces misleading quality scores.
All quality evaluation in this phase MUST be performed by sub-agents that read the rubric, read the student's actual answer, and judge whether the grading output is correct.
For each student, collect these required inputs:
../grading_workspace/rubric/lingliang/grading_guide.md
— Contains model answers, marking criteria, and mark allocation per question../grading_workspace/extracted/lingliang/students/student_XX.txt
— The student's actual written responses (extracted text from PDF)../grading_workspace/output/lingliang/student_XX.json
— The AI-generated grading with awarded_marks, reasoning, and evidence per questionAlso collect these optional diagnostic inputs if they exist:
../grading_workspace/extracted/students/page_images/student_XX/
— Useful when layout/diagram information may have been lost in text extraction../grading_workspace/rubric/page_images/
— Useful when rubric/question interpretation may depend on figures or layoutBefore auditing student grading outputs, launch one or more sub-agents to evaluate
the quality of the agent-facing rubric representation itself — primarily
../grading_workspace/rubric/lingliang/grading_guide.md, plus related rubric/question
page-image artifacts when they are needed to interpret visual marking criteria.
This benchmark is about whether the rubric is safe and reliable for downstream agents to use, not whether the official rubric document exists.
The sub-agent MUST evaluate the rubric on these 3 dimensions, scoring each on a 0–2