Skill for scanning a single passage/claim against source documents to compute similarity percentage, classify the match type, and produce citation recommendations for LaTeX. This is a subagent skill — called by the citation-agent orchestrator for each claim being verified. Handles exact matching, paraphrase detection, and similarity scoring.
You receive one claim (or a small batch of related claims) and one or more source .md files from the knowledge base. You determine whether the claim originates from a source, compute a similarity percentage, and recommend an action.
You will be given:
C1, C2.md files in research-kb/sources/Compute the similarity percentage (0-100%) between the user's claim and the best-matching passage in the source. Use this multi-factor approach:
Count shared content words (nouns, verbs, adjectives — ignore stop words) between the claim and the matching source passage.
lexical_score = (shared_content_words / total_unique_content_words_in_both) × 100
Check for shared phrases (n-grams of length 3-6 words). This catches structural copying that word-level analysis misses.
# Count matching 3-grams, 4-grams, 5-grams, 6-grams
# Weight longer matches more heavily:
ngram_score = (
matching_3grams × 1.0 +
matching_4grams × 2.0 +
matching_5grams × 3.5 +
matching_6grams × 5.0
) / max_possible_weighted_matches × 100
If any 8+ word sequence is identical → automatically set ngram_score = 100.
Does the claim preserve the sentence structure of the source?
Does the claim convey the same meaning?
similarity = (lexical × 0.30) + (ngram × 0.35) + (structural × 0.15) + (semantic × 0.20)
Round to the nearest integer.
| Score | Label | Colour | Action |
|---|---|---|---|
| 90-100% | VERBATIM | 🔴 | Must quote with \textcite or fully rewrite |
| 70-89% | CLOSE PARAPHRASE | 🟠 | Strongly recommend rewrite + \cite{} |
| 40-69% | MODERATE PARAPHRASE | 🟡 | Acceptable wording, add \cite{} |
| 20-39% | LOOSE REFERENCE | 🟢 | Citation recommended but optional |
| 0-19% | ORIGINAL | ✅ | No citation needed |
Read the "Notable Quotes & Citable Passages" section of each candidate source .md. Look for:
Read the "Paraphraseable Claims" section. Compare:
Skim the "Key Ideas", "Key Results", and "Methodology" sections for conceptual matches.
If no match is found in the .md files but the claim sounds like it could come from a source:
# Search the original PDF for key terms
pdftotext -layout [pdf_path] - | grep -i -C 5 "[distinctive term 1]"
pdftotext -layout [pdf_path] - | grep -i -C 5 "[distinctive term 2]"
Report the finding and note that the knowledge base summary may need updating.
For each claim, produce this structured output:
## Claim [ID]: "[claim text]"
### Match Result
- **Best match source**: `[bibtex_key]` — "[Paper title]"
- **Matching passage**: "[exact text from source]"
- **Location**: Section [X], Page [Y]
### Similarity Breakdown
| Factor | Score | Weight | Weighted |
|---|---|---|---|
| Lexical overlap | [N]% | 0.30 | [N×0.30]% |
| N-gram match | [N]% | 0.35 | [N×0.35]% |
| Structural similarity | [N]% | 0.15 | [N×0.15]% |
| Semantic equivalence | [N]% | 0.20 | [N×0.20]% |
| **TOTAL** | | | **[SUM]%** |
### Classification
- **Similarity**: **[SUM]%** [EMOJI] [LABEL]
- **Overlapping phrases**: "[phrase 1]", "[phrase 2]", ...
- **What's too close**: [Explain which specific words/structures are borrowed]
### Recommendation
- **Action**: [Quote it / Rewrite + cite / Just cite / No action needed / Verify manually]
- **LaTeX suggestion**:
```latex
[Corrected text with \cite{} command]
[Substantially rewritten text with \cite{} command]
If no match is found:
## Rewrite Guidelines
When suggesting rewrites for 🔴 and 🟠 claims:
1. **Change sentence structure**: Active→passive or vice versa, reorder clauses, split/merge sentences
2. **Replace vocabulary**: Not just synonyms — rephrase the concept differently
3. **Integrate into argument**: Make the claim serve the user's narrative, not just restate the source
4. **Always include citation**: Even good paraphrases need `\cite{}`
Example:
Source: "Self-attention computes a weighted sum of all positions in a sequence, allowing the model to attend to relevant information regardless of distance."
🔴 Bad (92%): Self-attention computes a weighted sum across all sequence positions, enabling the model to attend to relevant information irrespective of distance.
🟡 OK (55%): The self-attention mechanism weights each token's contribution based on relevance rather than proximity, enabling long-range dependency capture \cite{vaswani2017_attention}.
✅ Good (28%): Unlike recurrent approaches that struggle with distant dependencies, attention-based models can directly relate any two positions in a sequence through learned relevance weights \cite{vaswani2017_attention}.
## LaTeX Citation Patterns
Suggest the appropriate pattern:
```latex
% Standard parenthetical
... as shown in prior work \cite{key}.
% Narrative (author is subject)
\textcite{key} demonstrated that...
% With page (for specific claims)
... as stated in \cite[p.~15]{key}.
% Direct quote (VERBATIM matches)
As \textcite[p.~7]{key} note, ``[exact quote]''
% Multiple sources supporting same claim
... consistent findings across studies \cite{key1, key2, key3}.
\textcite. If the field has a standard definition, note it might be common knowledge.