Name: Lingliang Evaluation
Author: AKCqhzdy

搜索技能.../

lingliang-grading-workspaces/evaluation_workspace/
├── .github/skills/lingliang-evaluation/SKILL.md
├── START.md
├── WARNING.md
├── env.txt
├── env.txt.example
├── pyproject.toml
├── uv.lock
├── start.sh
├── data/                           # Local data (pre-copied into workspace)
│   └── groundtruth/lingliang/groundtruth_mapping.json
├── evaluation/lingliang/           # Evaluation output (this skill generates these)
│   ├── metrics.json                # Machine-readable metrics
│   ├── rubric_quality_benchmark.json  # Agent-facing rubric quality benchmark
│   ├── grading_quality_audit.json  # Per-student grading quality audit
│   ├── grading_consistency_audit.json # Cross-student same-question consistency audit
│   ├── grading_quality_report.md   # Diagnostic markdown summary
│   ├── eval_report.docx            # Class-level evaluation report
│   ├── figures/                    # Evaluation charts (PNG)
│   │   ├── score_scatter.png
│   │   ├── error_distribution.png
│   │   └── per_student_table.png
│   └── students/                   # Individual student evaluation reports
│       └── student_XX.docx
└── scripts/                        # Python helper scripts
    ├── config.py                   # Path configuration
    ├── validate_extraction.py
    ├── validate_grading_output.py
    └── validate_reports.py

{
  "student_id": 1,
  "questions": [...],
  "total_raw_score": 77,
  "total_max_score": 100,
  "percentage": 77.0
}

{
  "subject": "lingliang",
  "total_students": 20,
  "students": [
    {"student_id": 1, "total_raw_score": 77, "percentage": 77.0}
  ]
}

{
  "mappings": [
    {
      "student_id": 1,
      "student_label": "student_01",
      "student_name": "陳喜楓",
      "filename": "student_01.pdf",
      "total_score": 77,
      "max_score": 100,
      "score_percentage": 77.0
    }
  ]
}

source env.txt
uv sync

Metric	Description	Formula
Spearman ρ	Rank correlation between GT score and predicted score	`scipy.stats.spearmanr`
Pearson r	Linear correlation between GT score and predicted score	`scipy.stats.pearsonr`
MAE	Mean absolute error between GT and predicted scores	`mean(\|predicted - gt\|)`
RMSE	Root mean squared error	`sqrt(mean((predicted - gt)²))`
±5-point accuracy	% of students where \|predicted - GT score\| ≤ 5	`count(\|pred-gt\|≤5) / N × 100`
±10-point accuracy	% of students where \|predicted - GT score\| ≤ 10	`count(\|pred-gt\|≤10) / N × 100`
Max error	Largest absolute error across all students	`max(\|predicted - gt\|)`

{
  "subject": "lingliang",
  "total_students": 20,
  "metrics": {
    "spearman_rho": 0.85,
    "spearman_p_value": 0.001,
    "pearson_r": 0.88,
    "pearson_p_value": 0.0001,
    "mae": 4.2,
    "rmse": 5.8,
    "within_5_points_pct": 65.0,
    "within_10_points_pct": 90.0,
    "max_error": 15
  },
  "per_student": [
    {
      "student_id": 1,
      "gt_score": 77,
      "gt_percentage": 77.0,
      "predicted_score": 74,
      "predicted_percentage": 74.0,
      "score_diff": -3
    }
  ]
}

Lingliang Evaluation | Skills Pool

Lingliang Evaluation

Lingliang Evaluation

Lingliang Primary School Math Exam — Evaluation Skill

Subject-Specific Notes: Lingliang Evaluation

Lingliang Evaluation — Unified Skill (Adapted)

Overview

CRITICAL RULES

Workspace Layout

JSON Schema Reference

Grading Output (from grading skill)

Ground Truth

Phase 1: Environment Setup

Step 1.1: Verify Prerequisites

Step 1.2: Load and Parse Data

Phase 2: Compute Evaluation Metrics

Step 2.1: Core Metrics

Step 2.2: Save Metrics

Phase 5: Final Evaluation — Grading Quality Judge (LLM-as-Judge)

⛔ HARD RULE: NO Heuristic / Keyword-Matching Scripts

Step 5.1: Gather Required Inputs

Step 5.1a: Rubric Quality Benchmark (Agent-Facing Rubric Quality)

Healthcare Cdss Patterns

Drug Discovery

Qmd

Attack Tree Construction

Azure Ai Anomalydetector Java

Viboscope