Build evaluation pipelines for LLMs in legal tasks using a three-dimensional framework: outcome correctness, reasoning reliability, and trustworthiness. Use when asked to 'evaluate LLM legal performance', 'build a legal benchmark', 'test legal reasoning quality', 'audit LLM fairness in judicial tasks', 'create a legal eval suite', or 'assess LLM trustworthiness for law'.
This skill enables Claude to design, implement, and run structured evaluation pipelines for large language models performing legal tasks. It applies a three-dimensional evaluation framework from Hu et al. (2026) that goes beyond surface accuracy to assess outcome correctness, legal reasoning reliability, and trustworthiness (fairness, robustness, safety). The approach decomposes evaluation into result-focused, process-focused, and constraint-focused layers, drawing on established legal methodology like the IRAC framework (Issue, Rule, Application, Conclusion) and counterfactual fairness testing.
The paper's core insight is that legal LLM evaluation must operate on three independent dimensions, not just accuracy. A model can produce a correct verdict through flawed reasoning (e.g., citing a nonexistent statute), or produce consistently accurate results that are systematically biased against a demographic group. Single-metric evaluation misses both failure modes.
Dimension 1 -- Outcome Correctness uses standardized tasks (legal MCQ, judgment prediction, entity extraction) scored with traditional metrics: accuracy, F1, ROUGE-L, BERTScore, NDCG. These establish a performance baseline but say nothing about how the model arrived at its answer.
Dimension 2 -- Reasoning Reliability applies the IRAC decomposition. Each model response is broken into four stages: identifying the legal Issue, recalling the relevant Rule (statute, precedent), Applying the rule to the facts, and stating a Conclusion. Each stage is scored independently using expert-designed rubrics or structured LLM-as-judge prompts that check citation validity, logical coherence, and rule-fact alignment. This catches hallucinated citations and non-sequitur reasoning even when the final answer is correct.
Dimension 3 -- Trustworthiness uses counterfactual testing: swap legally irrelevant attributes (defendant name, gender, ethnicity) and measure output inconsistency. Metrics include inconsistency rate, bias regression coefficients, and disparity ratios across protected groups. This dimension also covers hallucination detection (fabricated case citations, invented statutes) and robustness to adversarial prompt injection.
Define the legal task scope. Identify the specific legal tasks under evaluation (e.g., case classification, contract clause extraction, judgment prediction, legal QA). Map each task to one of three categories: generation task, decision task, or retrieval task, since metric selection depends on this.
Select or construct evaluation datasets. For each task, choose from established benchmarks (LegalBench for multi-dimensional reasoning, CAIL2018 for judgment prediction, LeCaRDv2 for case retrieval, JudiFair for fairness) or construct a custom dataset. Ensure the dataset includes jurisdiction-specific material matching the deployment context. For custom datasets, include realistic noise: redundant facts, ambiguous clauses, and legally irrelevant details.
Implement outcome correctness metrics. For decision tasks, compute accuracy, precision, recall, and F1. For generation tasks, compute ROUGE-L and BERTScore against reference outputs. For retrieval tasks, compute NDCG@k and MRR. Store all results in a structured eval report (JSON or YAML).
Build IRAC reasoning rubrics. For each generation or reasoning task, create a scoring rubric with four components: (a) Issue identification -- did the model correctly identify the legal question? (b) Rule recall -- did it cite real, relevant statutes or precedents? (c) Application -- did it logically apply the rule to the given facts? (d) Conclusion -- is the conclusion consistent with the application? Score each component on a 0-3 scale with explicit criteria per level.
Implement reasoning evaluation. Use structured prompts to decompose model outputs into IRAC segments. Validate citations against a legal database or known-good reference set. Flag hallucinated citations (case names that don't exist, statutes with wrong section numbers). Score each segment using the rubric from step 4, either via expert review or a calibrated LLM-as-judge with few-shot legal examples.
Build counterfactual fairness tests. For each evaluation instance, create variants by swapping legally irrelevant attributes (names suggesting different demographics, gender pronouns, geographic indicators). Run all variants through the model. Compute inconsistency rate (fraction of instances where the output changes) and bias regression coefficients (correlation between attribute changes and output changes).
Test adversarial robustness. Inject adversarial perturbations: irrelevant but persuasive facts, contradictory precedents, prompt injection attempts asking the model to ignore instructions. Measure accuracy degradation and reasoning score changes under perturbation.
Aggregate into a three-dimensional eval report. Produce a structured report with separate scores for each dimension. Flag any dimension where scores fall below defined thresholds. Do not allow a high accuracy score to mask poor reasoning or fairness scores -- each dimension gates independently.
Example 1: Evaluating a Legal QA System
User: "I have a legal QA model that answers questions about US contract law. How do I evaluate it properly?"
Approach:
{
"task": "US Contract Law QA",
"outcome_correctness": {
"mcq_accuracy": 0.83,
"open_ended_rouge_l": 0.47,
"open_ended_bertscore": 0.71
},
"reasoning_reliability": {
"issue_identification": 2.6,
"rule_recall": 1.9,
"rule_application": 2.1,
"conclusion_consistency": 2.4,
"citation_hallucination_rate": 0.18
},
"trustworthiness": {
"counterfactual_inconsistency_rate": 0.07,
"bias_regression_coefficient": 0.02,
"adversarial_accuracy_drop": 0.12
},
"pass": false,
"blocking_issues": ["citation_hallucination_rate exceeds 0.10 threshold"]
}
Example 2: Building a Fairness Audit for Judicial Decision Support
User: "We're deploying an LLM to help judges with bail decisions. How do I test it for bias?"
Approach:
# Fairness evaluation pseudocode
for attribute in ["ethnicity", "gender", "age", "socioeconomic"]:
variants = group_by_attribute(results, attribute)
inconsistency = count_changed_decisions(variants) / total_scenarios
bias_coeff = regression_coefficient(attribute_values, decision_scores)
disparity = max_group_rate(variants) - min_group_rate(variants)
report[attribute] = {
"inconsistency_rate": inconsistency, # target < 0.05
"bias_coefficient": bias_coeff, # target < 0.01
"disparity_ratio": disparity # target < 0.03
}
Example 3: CI Pipeline for Legal Document Summarization
User: "Add legal eval to our CI pipeline for a statute summarization model."
Approach:
eval_config.yaml:Automate as a pipeline. Package the evaluation as a runnable script or CI job. Define pass/fail criteria for each dimension. Output machine-readable results (JSON) alongside a human-readable summary with per-task breakdowns and flagged failures.
Iterate with jurisdiction-specific adaptation. Adjust rubrics and datasets for the target legal system. IRAC maps directly to common law; for civil law jurisdictions, adapt to the analogous structure (fact finding, statutory interpretation, subsumption, decision). Update citation validation databases accordingly.