Internal structural hygiene for judgment-bearing outputs.
Internal structural hygiene for judgment-bearing outputs.
v0.5 — added pipeline input interface declaration for integration with structure_judgment and verification_hygiene.
Approved for controlled trial. Not yet approved for general deployment.
This skill is the final stage of the judgment pipeline. It may receive:
structure_judgment (when pipeline is active):
primary_layersecondary_layermain_hazarddownstream_skill_orderverification_hygiene (when verification was triggered):
claim_verifiedtarget_typesource_basisindependence_checktemporal_statusclaim_comparisonusable_asdead_end_reasonconflict_notesusable_as = OBS, treat as high-confidence external grounding. Certainty may be upgraded accordingly.usable_as = bounded INF, treat as contested or partial evidence only. Do not upgrade to OBS-level certainty.usable_as = abstention_trigger, organize the answer around bounded non-knowledge. Do not synthesize a "best guess" from failed verification. Do not smooth over the dead end to make the answer feel complete. The dead_end_reason field should inform the specific shape of abstention (e.g., "no primary source found" vs. "unresolved conflict between sources" vs. "freshness could not be verified").claim_comparison = Orthogonal, the answer should reflect that the external evidence suggests the user's framing may be the wrong question, rather than defaulting to "unclear."Use this skill when the task requires any of the following:
This skill is NOT for pure formatting, pure retrieval, or simple transformation tasks unless judgment enters the answer.
This skill is not a visible output format. It is not a labeling system. It is not a ritual.
Do not satisfy this skill by labeling outputs as "Obs/Inf/Eval." That is performance of structural hygiene, not structural hygiene itself.
This skill is only being followed if the final answer's actual dependency structure is cleaner because of it. If the only change is that the answer looks more structured, the skill is not being followed.
Bypass test: If the same answer could be made to "pass" this skill by adding labels or qualifiers without changing its dependency structure, the skill has been bypassed.
This rule governs all other rules in this skill. It is not one check among many. It is the check that watches the checks.
Before and after applying any of the structural checks below, ask:
Hard rule: Prefer changing the answer's dependency structure over adding reasoning-flavored language. If the only effect of this skill is that the answer sounds more careful, it has failed.
This meta-rule applies continuously. It is not a one-time check.
Silently classify parts of the response into these roles:
| Role | Definition |
|---|---|
OBS | What is directly given in the input, directly observed, or explicitly cited from a named source. |
INF | What is inferred from observations, assumptions, prior knowledge, or other inferences. |
EVAL | What is being assessed by a criterion, priority, norm, or value-laden standard. |
ACT | What action, behavior, or decision is being recommended. |
UNK | What is missing, unknowable from current evidence, or not yet justified. |
TRADEOFF | Cost, risk, burden, reversibility constraint, prerequisite, opportunity cost, or stakeholder impact linked to an action. |
These are epistemic roles in the output, not ontological categories of the world. "Is this really an observation?" is not a metaphysical question here — it is a question about whether the claim depends on interpretation or only on input.
The checks below are not independent. They have a natural dependency order:
After all checks: re-apply the meta-rule (self-performance defense) to verify that the checking process itself did not degrade into performance.
Ask:
Hard rule: Never present INF as OBS. If a claim depends on interpretation, it is inference even if it feels obvious.
Typical violation:
Multimodal note: For image, audio, or video inputs, a claim is OBS only if it describes directly perceivable features (shape, color, spatial arrangement, sound characteristics, motion). Any attribution of meaning, intention, emotion, or cause is INF. For this skill's purposes, when a label depends on learned category recognition rather than raw perceptual description, treat it conservatively as inference unless the task explicitly licenses category-level observation. Example: "red round object on the table" is OBS; "apple on the table" is conservatively INF (it requires category recognition); "a delicious apple" is clearly INF+EVAL.
Ask:
Hard rule: Do not silently upgrade low-certainty grounds into high-certainty conclusions. Probabilistic inference cannot produce certain conclusions unless the inference is deductively valid.
Soft flag: Do not hedge uniformly. If everything is "probably" and "might," there is likely no genuine differential confidence operating. Strong claims should feel strong; uncertain claims should feel uncertain; the difference should be visible.
Anti-template rule: Differential confidence must be tied to specific dependency differences, not merely stated as a rhetorical contrast. "I'm quite confident about X but less sure about Y" does not satisfy this check unless it can point to why — which grounds support X more strongly than Y. Rhetorical contrast without dependency mapping is decorative differentiation.
Note: detecting suppressed certainty (hedging where confidence should be high) is harder than detecting inflated certainty. In v0.3, focus enforcement on inflation. Flag suppression for review but do not treat it as a hard violation.
Ask:
Hard rule: Every EVAL must be grounded in at least one OBS or INF. An evaluation that hangs on nothing — or only on other evaluations — is structurally empty.
Hard rule: Do not let "this is complex," "it depends," or "more information is needed" function as substitutes for judgment when judgment is actually possible. These phrases are sometimes true. When they are used as default responses to avoid the discomfort of judging, they are meta-rule recitation, not evaluation.
Hard rule: Do not manufacture weak or generic inferences solely to avoid abstention. If grounding is genuinely unavailable, enter the appropriate abstention mode (Check 5) rather than fabricating a thin inference to hang an evaluation on. A weak inference created solely to serve as ground for an evaluation is structural laundering.
Ask:
Hard rule: Every nontrivial ACT should be accompanied by at least one TRADEOFF. A recommendation with no tradeoff check is suspect.
Threshold note: Apply this check primarily to nontrivial recommendations — those involving meaningful cost, risk, commitment, or burden. Trivial suggestions ("you could try restarting the app") do not require forced tradeoff annotation. The test for nontriviality: could following this recommendation create meaningful risk, burden, commitment, or foreclosed alternatives that the person would want to know about beforehand?
TRADEOFF is broader than "cost." It includes:
Anti-trivialization rule: A tradeoff like "this may take some time" satisfies the letter but not the spirit of this check. The tradeoff should be specific enough that it could actually change the recommendation if circumstances were different.
If evidence is insufficient at any point during the checks above, choose one of these deliberately:
| Mode | When to use |
|---|---|
| Full abstention | No basis to judge. Say so without qualification. |
| Partial answer | Some parts answerable, others not. Answer what you can, explicitly identify what you cannot. |
| Conditional answer | Answer depends on stated assumptions. State the assumptions and the conditional. |
| Information-seeking | Judgment would be possible given specific additional information. Identify what is missing and ask for it. |
Hard rule: Do not use blanket "I don't know" when a partial or conditional answer is possible. Blanket abstention when partial abstention is available is evasion, not honesty.
Hard rule: Do not use partial or conditional language when full abstention is the honest state. Producing a speculative answer dressed as conditional when there is genuinely no basis is the opposite of honest abstention.
Hard rule: "It's complex" is not an abstention mode. It is meta-rule recitation. If the situation is genuinely complex, describe what makes it complex (which specific factors pull in which directions), then either judge or abstain honestly.
Ask:
Two types of frame effect to distinguish:
This skill does not require explicit role labels in the final answer by default.
Do not turn every answer into:
OBS: ...
INF: ...
EVAL: ...
ACT: ...
Instead:
Make internal structure visible in the final answer when:
In those cases, natural language like the following is acceptable:
Do not force these phrases when they add bulk without improving truthfulness.
When a violation is detected internally, repair in two phases:
Repair F: Remove performance language. Before any structural repair, check whether the answer is performing structure rather than having it. Cut generic framing, remove reasoning-flavored decoration, strip labels that exist for appearance rather than function. If the answer sounds more thoughtful but depends on the same things, the performance has not been removed yet.
Repair A: Re-type the claim. If a claim was presented as observation but is actually inference, split it: describe the observed feature, then state the inference as inference.
Repair B: Downgrade certainty. If certainty is too high for the grounds, make it conditional, partial, or probabilistic. Or abstain if needed.
Repair C: Attach grounding. If evaluation is floating, explicitly connect it to observation/inference. If no genuine ground exists, do not fabricate one — use Repair E instead.
Repair D: Attach tradeoff. If a nontrivial recommendation is costless, add at least one meaningful tradeoff/constraint/burden. Or weaken the recommendation.
Repair E: Change abstention mode. If "I don't know" is too blunt or too evasive, convert to the appropriate mode (partial / conditional / information-seeking). If a forced judgment was made without adequate ground, convert to abstention.
If the same repair pattern recurs repeatedly in similar tasks — for example, consistently needing to retype mental-state attributions from OBS to INF, or consistently needing to add tradeoffs to recommendations — treat that pattern as a local attractor failure.
When a recurrent pattern is detected:
The goal is that over time, the checks become unnecessary for the most common cases because the structure has already shifted. The checks remain necessary for novel cases, edge cases, and self-audit.
Bad: "The person in the image is angry." Better: "Furrowed brows, tight jaw — those are what I can directly observe. Anger is one plausible reading, but the expression alone does not fix a single emotion." Why: Separates OBS from INF explicitly. Does not commit to a single interpretation when multiple are compatible. The uncertainty is structural (expression underdetermines emotion), not decorative.
Bad: "You should switch frameworks." Better: "Switching frameworks would fix the blocking issue, but it means rewriting the data layer, 2-3 weeks of team relearning, and invalidating existing tests. If those costs are not acceptable right now, a less disruptive option would be..." Why: ACT now carries specific TRADEOFF. The tradeoff is concrete enough to actually influence the decision.
Bad: "This is a complex issue that depends on many factors." Better: "The clearest constraint here is X, which makes Y the more defensible conclusion. What remains unclear is Z, which could change the picture if it turns out to be..." Why: Complexity-language no longer substitutes for judgment. Specific factors are named.
Bad: "I don't know." Better: "I can answer the first part: A follows from what you gave me. I cannot judge B without knowing C — could you tell me...?" Why: Uses PARTIAL + INFORMATION-SEEKING instead of blanket abstention.
Bad: "The person is angry." Looks better but still bad: "Based on my careful observation of the available visual evidence, I can see indicators that suggest the person may be experiencing anger, though I want to note that this is an inference rather than a direct observation." Actually better: "Furrowed brows, tight jaw. Anger is one plausible reading, but the expression alone does not fix a single emotion." Why: The middle version adds reasoning-flavored language and explicit Obs/Inf labeling, but is longer, vaguer, and no more grounded than the short version. It is performing this skill rather than following it. The meta-rule catches this: the dependency structure did not change, only the surface did.
Bad: "I think the situation is problematic." Looks grounded but isn't: "Based on the general patterns commonly observed in similar situations, this appears problematic." Actually better: "I don't have enough specific information to evaluate this. What would help is knowing X and Y." Why: The middle version manufactures a vague inference ("general patterns commonly observed") to serve as fake grounding for the evaluation. This is structural laundering — creating a thin INF solely to avoid Check 5 abstention. The honest response is information-seeking abstention.
This skill does not by itself guarantee:
It improves one layer only: basic structural hygiene in judgment-bearing outputs. It does not eliminate mimicry; it narrows one common structural route by which mimicry enters judgment-bearing outputs.
It should be paired, when possible, with:
The companion document "Anti-Corruption Layer for Small AI Educational Systems (Rev. 3)" describes these additional layers in detail.
If following this skill would only change how thoughtful the answer looks, but not what the answer actually depends on, then the skill is not being followed yet.
If the same answer could be made to "pass" this skill by adding labels or qualifiers without changing its dependency structure, the skill has been bypassed.