Grades Lingliang primary school math exam papers. Handles per-question rubric extraction from 37 individual rubric files, reference calibration from 5 students with known scores, per-student grading of 20 students with per-question mark allocation, and score compilation. No level division (raw scores only).
Data is symlinked into this workspace under data/.
Masked data: data/masked_data/lingliang/
Reference data: data/reference/lingliang/
Masked data path: data/masked_data/lingliang/
student_answers/student_01.pdf … student_26.pdf (20 grading students; excludes reference students 07, 08, 11, 17, 23)rubrics/rubrics_q-{N}.pdf and rubrics/rubrics_q-{N}.jpg (per-question, 37 sets)question/q-{N}.txt (extracted text) and question/q-{N}.jpg (question image) for each questionquestion_manifest.csvReference data path: data/reference/lingliang/
student_answers/student_07.pdf, student_08.pdf, student_11.pdf, student_17.pdf, student_23.pdfreference_mapping.json — contains total_score, max_score, score_percentage per reference student (NOT DSE levels)Answer page mapping: Use answer_page_map.json (in source data at data/z_test_data/lingliang_prework/answer_page_map.json) to map each question number to the page in each student's 6-page PDF
Grading notes:
The full grading workflow follows the unified skill adapted for this subject. The complete content is reproduced below for reference.
This skill grades Lingliang primary school math exam papers by extracting rubrics from per-question PDF/JPG pairs, calibrating against reference data with known scores, then grading each student's scanned handwritten answer PDF on a per-question basis. It produces per-student JSON results and compiles them into final raw scores.
Subject: Lingliang (小學數學考試) — primary school math, 37 questions, 100 total marks, 25 students (5 reference + 20 grading targets).
Key differences from HKDSE subject grading:
question_manifest.csv drives rubric iterationanswer_page_map.json maps questions to pages in each student's 6-page PDFSub-agents are mandatory. Each student MUST be graded by a dedicated sub-agent. No batch grading of multiple students in a single sub-agent call.
Model restrictions:
model
parameter override to sub-agents — let them inherit the main agent's model.gemini-3.1-flash-lite-preview
via the Gemini route exclusively. NO OTHER VLM MODELS OR ROUTES ARE ALLOWED.Gemini only (for both added chains):
extraction skill and model-service skill must both use Gemini
for API calls.GEMINI_BASE_URL, VERTEX_API_KEY from env.txt.gemini-3.1-flash-lite-preview.model-service skill: run a config-based route sanity check that
confirms route=gemini is callable and non-gemini routes are ignored.PDF rendering DPI: When converting PDF pages to images (e.g., for VLM extraction), default to 150 DPI to ensure text is legible. If processing speed is a concern, lowering DPI is acceptable as a trade-off, but do not go below 100 DPI.
Image preservation is mandatory when flagged by extraction. When a page's
meaning depends on layout or visuals — e.g., geometric figures, handwritten
mathematical notation with spatial arrangement (fractions, long division layouts),
graphs, tables with spatial meaning, arrows/labels, or annotated drawings —
the extraction API output must explicitly signal this (via
needs_visual_reference + visual_reference_reason). The agent MUST inspect those
returned hints and export the whole page image as a PNG artifact for every flagged
page. This applies to both rubric/question pages AND student answer pages. Do not
crop at this stage.
No LLM/VLM API calls for generating feedback or analysis text. See . VLM is permitted only for tasks requiring genuine visual inspection — text extraction from scanned PDFs where PyMuPDF fails, and interpreting handwritten mathematical notation (fractions, division symbols, etc.). VLM must NOT be used for any dimension assessable from extracted text.
extraction skill.model-service skill (YAML-only, no .py edits).gemini-3.1-flash-lite-preview) and read URL/API key values from env.txt;
missing values mean that provider/API is unsupported in the current environment.Use this grading skill as the orchestrator and explicitly chain the migrated skills:
extraction skill):
model-service skill):
gemini-3.1-flash-lite-preview)extraction and model-service are config-driven execution skills; do not
modify their underlying Python scripts inside this grading workflow.lingliang-grading-workspaces/grading_workspace/
├── .github/skills/lingliang-grading/SKILL.md
├── START.md
├── WARNING.md
├── env.txt # Local credentials (never commit)
├── env.txt.example # Template
├── pyproject.toml
├── uv.lock
├── start.sh
├── data/ # Local data (symlinked into workspace)
│ ├── reference/lingliang/
│ │ ├── student_answers/student_07.pdf, student_08.pdf, student_11.pdf, student_17.pdf, student_23.pdf
│ │ └── reference_mapping.json
│ ├── masked_data/lingliang/
│ │ ├── student_answers/student_01.pdf … student_26.pdf (excludes 07, 08, 11, 17, 23)
│ │ ├── rubrics/rubrics_q-{N}.pdf + rubrics_q-{N}.jpg (37 per-question pairs)
│ │ ├── question/q-{N}.txt + q-{N}.jpg (37 per-question materials)
│ │ └── question_manifest.csv
│ └── z_test_data/lingliang_prework/
│ └── answer_page_map.json
├── rubric/ # Grading artifacts (no year nesting)
│ ├── grading_guide.md
│ ├── reference_calibration.md
│ ├── page_images/ # Full-page PNG artifacts for rubric/question pages (MANDATORY when visual content detected)
│ │ ├── rubrics/page_{P}.png # Rubric pages with diagrams, reference answer figures
│ │ └── question/page_{P}.png # Question pages with diagrams, graphs, geometric figures
│ ├── reference_scores.json
│ └── calibration/ # Intermediate calibration artifacts
│ └── (per-student reference grading outputs, draft rubrics, score calculations)
├── rubric/reference_data_analysis/ # Insights from reference data (Phase 3)
│ ├── score_correlation_analysis.md # Observed patterns across score ranges
│ ├── rubric_gaps.md # Gaps/ambiguities in rubric
│ └── rubric_refinements.md # Supplementary rubric guidance
├── scripts/ # Python helper scripts
│ ├── generate_class_report.py
│ ├── generate_student_reports.py
│ ├── validate_extraction.py
│ ├── validate_grading_output.py
│ └── validate_reports.py
├── extracted/ # Extracted student text (no year nesting)
│ └── students/
│ ├── student_{NN}.txt
│ └── page_images/student_{NN}/page_{P}.png # Visual-page artifacts (MANDATORY when [IMAGE_DATA] present)
└── output/ # Grading output (no year nesting)
├── student_{NN}.json
└── final_scores.json
source env.txt
uv sync
No GRADING_YEAR env var is needed — paths are fixed with no year-based structure.
Confirm the following exist:
data/masked_data/lingliang/student_answers/student_01.pdf … student_26.pdf (20 grading students, excluding 07, 08, 11, 17, 23)data/masked_data/lingliang/rubrics/rubrics_q-1.pdf + rubrics_q-1.jpg … through all 37 questionsdata/masked_data/lingliang/question/q-1.txt + q-1.jpg … through all 37 questionsdata/masked_data/lingliang/question_manifest.csvdata/z_test_data/lingliang_prework/answer_page_map.jsondata/reference/lingliang/reference_mapping.jsondata/reference/lingliang/student_answers/student_07.pdf, student_08.pdf, student_11.pdf, student_17.pdf, student_23.pdfRead BATCH_SIZE from env.txt (default: 5) for parallel sub-agent control.
Read data/masked_data/lingliang/question_manifest.csv to get the full list of all 37
questions. The CSV contains columns:
question — question identifier (e.g., "1", "3.a", "9.decimal", "15.b")question_page — page number in the student answer PDFrubric_page — page in the rubric PDFquestion_text — the question textquestion_text_path — path to extracted question text file (e.g., question/q-1.txt)question_image_path — path to question image (e.g., question/q-1.jpg)rubric_pdf_path — path to rubric PDF (e.g., rubrics/rubrics_q-1.pdf)rubric_image_path — path to rubric image (e.g., rubrics/rubrics_q-1.jpg)⛔ Do NOT look for or read any
gold_csv_pathor*-gold.csvfiles during Phase 2. Those files contain ALL students' ground truth scores and must not be used for rubric extraction. Phase 2 uses ONLYrubric_pdf_path,rubric_image_path,question_text_path, andquestion_image_path.
For EACH of the 37 question rows in the manifest, extract the rubric by:
rubric_pdf_path (e.g., rubrics/rubrics_q-1.pdf)
— this is the primary source for marking criteria and acceptable answersrubric_image_path (e.g., rubrics/rubrics_q-1.jpg)
— use this as a visual fallback or cross-reference if the PDF extraction is
incomplete, or if the rubric contains handwritten annotations, diagrams, or
mathematical notation that PyMuPDF cannot renderquestion_text_path (e.g., question/q-1.txt)
— this provides the extracted question text for contextquestion_image_path (e.g., question/q-1.jpg)
— use this to understand the visual layout of the question, especially for
questions involving diagrams, graphs, tables, or geometric figuresFor each question, extract and record:
Use PyMuPDF (fitz) for text extraction from rubric PDFs. Fall back to VLM only if
text extraction yields empty or garbled content (common for scanned rubrics with
handwritten annotations or mathematical notation).
Chain to extraction skill: perform this phase through the extraction skill's
YAML-driven pipeline. Since there are 37 individual rubric files (not one large PDF),
configure the extraction to process each rubric file independently. Force Gemini
routing (model_routes -> gemini) with env-backed provider config only
(GEMINI_BASE_URL, VERTEX_API_KEY).
Note on rubric extraction: Each rubric file is a single page covering one question. The JPG version is a rendered image of the same content — use it when PDF text extraction fails or when mathematical notation needs visual interpretation.
Processing strategy for 37 rubrics: Process rubric files in batches (e.g., 5–10 at a time) to balance parallelism with resource constraints. For each batch:
Create rubric/grading_guide.md with:
question_text_path) for contextGrading guide structure: Organize the guide by question number in ascending order. For each question, use the following template:
### Question {N} ({max_marks} marks)
**Question text:** {from question_text_path}
**Marking criteria:**
- Criterion 1: {description} — {marks} mark(s)
- Criterion 2: {description} — {marks} mark(s)
- ...
**Acceptable answers:** {list of acceptable forms}
**Common errors:** {known incorrect approaches that should NOT earn marks}
**Special notes:** {any rubric-specific instructions}
For sub-questions (e.g., 3.a, 3.b, 9.decimal, 9.fraction), nest them within their parent question section with clear sub-headings.
Read data/reference/lingliang/reference_mapping.json to get the mapping of student
IDs to known scores.
Format:
{
"mappings": [
{"student_id": 7, "filename": "student_07.pdf", "total_score": 100, "max_score": 100, "score_percentage": 100.0},
{"student_id": 8, "filename": "student_08.pdf", "total_score": 74, "max_score": 100, "score_percentage": 74.0},
{"student_id": 11, "filename": "student_11.pdf", "total_score": 56, "max_score": 100, "score_percentage": 56.0},
{"student_id": 17, "filename": "student_17.pdf", "total_score": 87, "max_score": 100, "score_percentage": 87.0},
{"student_id": 23, "filename": "student_23.pdf", "total_score": 30, "max_score": 100, "score_percentage": 30.0}
]
}
Note: Reference files use student_{NN}.pdf naming (same format as grading students).
The known ground-truth scores are: student_07 (100/100), student_08 (74/100),
student_11 (56/100), student_17 (87/100), student_23 (30/100). These span a wide range
from 30% to 100%, providing good calibration anchors.
Extract text from ALL reference student answer PDFs in
data/reference/lingliang/student_answers/.
Chain to extraction skill: use extraction pipeline in student-PDF mode
(pdf_dir + optional students) to produce structured extraction outputs that
feed reference calibration. Force Gemini routing for this chain.
Note: These are scanned handwritten math answers (6 pages each). VLM extraction is likely necessary for mathematical notation (fractions, division, etc.).
The rubrics for reference grading are the SAME as for grading — use the per-question
rubric files already extracted in Phase 2 (from data/masked_data/lingliang/rubrics/).
There is no separate reference-year rubric directory.
Grade ALL 5 reference students using the same rubric and grading guide (from Phase 2).
For each reference student, apply the per-question mark allocation to produce a total
raw score and percentage — exactly as you would for a masked student. Record these
scores alongside their known total scores from reference_mapping.json.
This step is critical: the resulting scores provide empirical calibration — verifying that the rubric as interpreted produces scores consistent with known ground-truth scores. Launch sub-agents for reference students following the same rules as Phase 5 (one sub-agent per student, same model restriction).
Use answer_page_map.json to determine which page(s) of each student's 6-page PDF
contain the answer for each question. This is essential for accurate per-question grading
of scanned handwritten answers.
Save reference grading results to rubric/reference_scores.json:
{
"scores": [
{"student_id": 7, "total_score": 100, "ai_total_raw_score": 98, "ai_total_max_score": 100, "ai_percentage": 98.0, "score_percentage": 100.0},
{"student_id": 8, "total_score": 74, "ai_total_raw_score": 71, "ai_total_max_score": 100, "ai_percentage": 71.0, "score_percentage": 74.0},
{"student_id": 11, "total_score": 56, "ai_total_raw_score": 54, "ai_total_max_score": 100, "ai_percentage": 54.0, "score_percentage": 56.0},
{"student_id": 17, "total_score": 87, "ai_total_raw_score": 85, "ai_total_max_score": 100, "ai_percentage": 85.0, "score_percentage": 87.0},
{"student_id": 23, "total_score": 30, "ai_total_raw_score": 32, "ai_total_max_score": 100, "ai_percentage": 32.0, "score_percentage": 30.0}
]
}
The rubric must not be finalised from the official PDF alone. The agent must learn from the reference student data to derive an effective grading rubric — one that actually produces scores consistent with known ground-truth scores. This is an iterative loop that continues until the rubric reliably reproduces expected scores.
This is especially important for primary school math where handwritten answers may use non-standard notation, alternative solution methods, or ambiguous layouts. The rubric must account for these realities — reference data analysis reveals what students actually write and how it maps to marks.
All analytical outputs from this step are saved to rubric/reference_data_analysis/.
Iterative loop:
rubric/reference_data_analysis/score_correlation_analysis.md documenting
the correlation between AI scores and known scores:
rubric/reference_data_analysis/rubric_gaps.md — gaps or
ambiguities found in the rubric when applied to reference students
(e.g., criteria that fail to distinguish partial credit correctly, missing guidance
for common student errors, mark schemes that reward or penalise inconsistently,
unclear treatment of alternative solution methods)
b. Derive refinements in rubric/reference_data_analysis/rubric_refinements.md
— supplementary rubric interpretations and practical guidance that resolve the
gaps above (e.g., clarifying what constitutes a "correct method" for partial credit,
specifying how to handle equivalent mathematical expressions, defining acceptable
rounding behavior, handling crossed-out work vs. final answers)
c. Revise rubric/grading_guide.md to incorporate these
refinements — the grading guide is not simply extracted from the official rubric
PDFs; it must incorporate insights from the reference data analysis
d. Re-grade affected reference students with the improved rubric
e. Repeat from step 2 until the discrimination check passesrubric/grading_guide.md must incorporate insights from
rubric/reference_data_analysis/ — it is not simply a transcription of the
official rubric PDFs.Discrimination check criteria — verify ALL of the following:
Only proceed to Phase 4 (actual student grading) once the discrimination check
passes. Document the outcome in rubric/reference_calibration.md.
Temp file organization: All intermediate artifacts from calibration — including
reference student grading outputs, draft rubrics, and intermediate score calculations
— must be saved under rubric/calibration/ (a dedicated subdirectory).
Final reference artifacts (reference_scores.json) are saved under rubric/.
Final grading artifacts (grading_guide.md, reference_calibration.md) are saved
under rubric/.
Using the reference scores from Step 3.4, verify that the AI grading produces scores that correlate with known ground-truth scores. Specifically:
These correlations validate that the rubric interpretation is reasonable. If the correlation is poor, return to Step 3.4a for further rubric refinement.
Create rubric/reference_calibration.md with TWO sections:
Section 1 — Qualitative Calibration: Summarise observations from grading the reference students across the score spectrum:
Focus on the main question types for this exam (arithmetic, fractions, word problems, geometry). Include specific examples of how reference students at different score levels approached key questions.
Section 2 — Quantitative Score Correlation: Include the empirical AI vs. ground-truth comparison from Step 3.4:
This calibration data provides confidence that the rubric interpretation is sound before grading the 20 target students.
CRITICAL: Student answer extraction MUST be done 1 page at a time to ensure extraction quality. Do NOT batch multiple pages into a single extraction call.
For each of the 20 grading students in data/masked_data/lingliang/student_answers/:
extracted/students/student_{NN}.txtNote: Student PDFs are scanned handwritten math work (6 pages each). VLM extraction
is very likely needed for most or all pages, since PyMuPDF typically cannot extract
handwritten text. Use answer_page_map.json during grading (Phase 5) to locate which
page contains the answer for each question.
Handwritten math extraction challenges:
/ symbol — VLM must interpret these correctlyChain to extraction skill: this phase must be implemented via the extraction skill
with request_mode=page_by_page (or page_by_page_with_prev only when boundary context
is necessary) to enforce one-page extraction quality requirements, and route via
Gemini only.
Note: This 1-page-at-a-time rule applies to student answer PDFs. For rubric/criteria PDFs (Phase 2), multi-page extraction is acceptable since those are typeset documents with cleaner formatting.
After extraction, inspect the extraction results for visual-content hints. If any
student's extraction output includes needs_visual_reference: true or [IMAGE_DATA]
placeholders, export the corresponding full-page PNG images from the student's
PDF to extracted/students/page_images/student_{NN}/page_{P}.png.
This is especially important for handwritten math answers containing:
Use PyMuPDF at 150 DPI (or the DPI set in Rule 4) to render each flagged page. These page images allow Phase 5 grading sub-agents to view the original handwritten content when the extracted text is insufficient.
Confirm all 20 student text files exist and have non-trivial content. For each student:
Write a script scripts/validate_extraction.py that:
extracted/students/student_{NN}.txt filesextracted/students/page_images/student_{NN}/ exists, verify it contains
at least one .png file and report the page-image count.txt file contains [IMAGE_DATA], verify
that extracted/students/page_images/student_{NN}/ exists AND contains at
least one .png file. Fail validation if [IMAGE_DATA] is present but no page
images exist — this means Step 4.1a was skipped[VLM_DESCRIPTION: tags exist, report their count (supplementary, not blocking)Run the script and confirm all students pass before proceeding to Phase 5. If any fail, re-run extraction for those students.
For each of the 20 grading students, launch a dedicated sub-agent that:
Reads rubric/grading_guide.md
Reads rubric/reference_calibration.md (both qualitative and
quantitative sections — including score correlation data)
Calibration note: The
reference_calibration.mddata validates that the rubric interpretation is consistent. Use it as a sanity check — if a student's score seems unusually high or low, verify the per-question grading is correct against the rubric.
Reads the student's extracted text (extracted/students/student_{NN}.txt)
Reads answer_page_map.json to know which page(s) contain the answer for each question
Grades EVERY question (all 37) against the rubric — focus on accurate per-question mark allocation, applying the rubric criteria precisely
Computes total score and percentage from per-question marks
Outputs a JSON file to output/student_{NN}.json
Handling visual content with [IMAGE_DATA] (MANDATORY — direct image viewing):
When the student's extracted text contains [IMAGE_DATA] for any question, the
sub-agent MUST:
view tool on the PNG file
at extracted/students/page_images/student_{NN}/page_{P}.pngdata/masked_data/lingliang/question/q-{N}.jpg) to understand the question's
visual context (diagrams, geometric figures, graphs)data/masked_data/lingliang/rubrics/rubrics_q-{N}.jpg) to understand what the
correct visual answer looks like[VLM_DESCRIPTION] text as supplementary context, but do NOT rely on it
as the sole basis for grading visual contentevidence field (e.g., "student drew
correct geometric construction with compass marks visible")IMPORTANT: Sub-agents must NOT use hard score restrictions to constrain their grading. The correct approach is: grade each question purely on rubric merit → sum the scores → report the total. Do NOT adjust per-question marks to force a particular total outcome.
Math-specific grading guidance for sub-agents:
evidence field. Grade based on
the most reasonable interpretation, but flag it for review.Control parallelism with BATCH_SIZE — launch up to BATCH_SIZE sub-agents at a time.
Each output/student_{NN}.json MUST follow this schema:
{
"student_id": "NN",
"questions": [
{
"question_id": "1",
"max_marks": 2,
"awarded_marks": 2.0,
"reasoning": "Step-by-step analysis citing rubric criteria: (1) [Rubric criterion] — student answer satisfies this because [specific part of answer]. Therefore 2/2 marks awarded.",
"evidence": "Direct quote from student answer: '[verbatim student text]'."
},
{
"question_id": "3.a",
"max_marks": 3,
"awarded_marks": 2.0,
"reasoning": "Step-by-step analysis: (1) Correct method used — 2 marks; (2) Final answer has arithmetic error — 0 marks for answer mark. Therefore 2/3 marks awarded.",
"evidence": "Student work: '[verbatim calculation steps]'. Final answer: '[student's incorrect answer]' vs. expected '[correct answer]'."
}
],
"total_raw_score": 78,
"total_max_score": 100,
"percentage": 78.0
}
Fields:
student_id: Student number string (e.g., "01", "14", "26")questions: Array of per-question grading objects (37 entries)
question_id: String identifier matching the manifest (e.g., "1", "3.a", "9.decimal", "15.b")max_marks: Maximum marks for this questionawarded_marks: Marks awarded (float, can be 0.5 increments if rubric allows)reasoning: Step-by-step analysis of mark allocation, always grounded in the rubric first.
For each marking point in the grading guide: (a) state the rubric criterion being assessed,
(b) explain whether and how the student's answer satisfies it, (c) for partial marks, specify
exactly what was present and what was absent. For math questions, distinguish between method
marks and answer marks where applicable. General subject-knowledge conventions may
supplement the rubric only where the rubric is genuinely silent — and this must be stated
explicitly (e.g., "Rubric does not specify, but standard practice requires…").evidence: Direct verbatim quote(s) from the student's answer that demonstrate satisfaction
(or non-satisfaction) of each rubric criterion. Do not paraphrase — quote the student answer
exactly. For math questions, include the student's working steps and final answer.total_raw_score: Sum of all awarded_marks (should be out of 100)total_max_score: Sum of all max_marks (should equal 100)percentage: (total_raw_score / total_max_score) × 100Grading methodology — sub-agents MUST follow this sequence:
answer_page_map.json to locate the correct page, then find the portion of the extracted text that responds to this question.reasoning field must reference the specific rubric criterion and the specific student text (verbatim in evidence) that supports or fails each mark.After each sub-agent completes:
Write a script scripts/validate_grading_output.py that:
output/student_{NN}.json filesstudent_id, questions, total_raw_score, total_max_score, percentage), value ranges (percentage ∈ [0,100], total_max_score == 100), and that total_raw_score equals the sum of all awarded_marksRun the script. Re-run the sub-agent for any failing students. Re-run the script until all pass.
Read all 20 output/student_{NN}.json files and compile into
output/final_scores.json:
{
"subject": "lingliang",
"total_students": 20,
"max_possible_score": 100,
"students": [
{
"student_id": "01",
"total_raw_score": 78,
"total_max_score": 100,
"percentage": 78.0
}
],
"statistics": {
"mean_score": 65.3,
"median_score": 67.0,
"std_dev": 18.4,
"min_score": 22,
"max_score": 95,
"mean_percentage": 65.3
}
}
Optionally, compute per-question statistics to identify questions that were particularly easy or difficult across the cohort:
This per-question analysis is informational — it does not affect individual student scores but provides useful diagnostic data.
⚠️ DO NOT EXECUTE — Phase 7 is disabled for this subject. Lingliang is a primary school math exam with no level classification system. The grading output is raw scores only. Skip Phase 7 entirely.
⚠️ DO NOT EXECUTE — This phase is disabled for now. Skip Phase 8 entirely. Do not run any steps in this phase.
After score compilation, review the overall score distribution:
Randomly select 2–3 students and manually verify their grading:
Compare students with similar total scores:
Confirm all output files are complete and valid:
student_{NN}.json files exist with valid JSON (20 students)final_scores.json contains all 20 students with correct statisticsdata/masked_data/lingliang/
rubrics/rubrics_q-{N}.pdf + rubrics/rubrics_q-{N}.jpg (37 per-question pairs)question/q-{N}.txt + question/q-{N}.jpgquestion_manifest.csvdata/reference/lingliang/
reference_mapping.json with known total scoresgemini-3.1-flash-lite-preview, 1 page per call)question_manifest.csv references files
that don't exist on disk, fail with a clear listing of missing files before proceedinguv sync)model-service skill run at least once to confirm Gemini (gemini-3.1-flash-lite-preview) route availability from env.txtquestion_manifest.csv verified (37 questions, all paths valid)answer_page_map.json verified and accessiblerubric/grading_guide.md)rubric/reference_scores.json)rubric/reference_calibration.md)extraction skill used in Phase 2/3/4 extraction tasks with Gemini (gemini-3.1-flash-lite-preview) routingscripts/validate_grading_output.py passes for all 20 students)output/final_scores.json)WARNING.mdRead split-part PDFs in numerical order. Files named _part1.pdf, _part2.pdf,
…, _partN.pdf MUST be read in sequence. Concatenate the extracted content in order.
Ignore _original.pdf files (these are the unsplit source).
Reference data informs score calibration. Grade reference students to compute
empirical score ranges. Compare AI-graded scores against known total scores from
reference_mapping.json. The reference data provides calibration anchors — use
them to verify the rubric is being applied consistently and that the grading
produces scores that correlate with known ground-truth scores.
No GRADING_YEAR needed — paths are fixed (no year-based structure). All data
paths use data/masked_data/lingliang/ and data/reference/lingliang/ directly.
There is no year nesting.
All report and commentary text must be written in Traditional Chinese (繁體中文). This applies to all narrative, feedback, analysis, and section content in generated DOCX reports. English is only permitted for: variable names, file paths, technical identifiers, chart axis labels, and JSON field names.
Skill-chain requirement (new):
reasoning