Detect and interpret figurative language (sarcasm, humor, offense, metaphor) in multimodal image-text content using a structured five-step reasoning chain. Triggers: 'detect sarcasm in this meme', 'is this image sarcastic', 'analyze figurative meaning', 'interpret this humor/metaphor', 'classify meme sentiment', 'explain why this meme is funny/offensive'
This skill enables Claude to interpret figurative language in multimodal (image + text) content by applying a structured five-step reasoning chain derived from the CrossStyle-MMR framework. Instead of making snap classification judgments, Claude decomposes figurative language analysis into explicit steps: describing visual content, interpreting the caption, detecting mismatches between modalities, inferring communicative intent, and arriving at a justified label. This approach transfers across figurative styles -- reasoning trained on sarcasm improves humor detection and vice versa -- enabling robust cross-style generalization from a single unified pipeline.
The core insight is that figurative language understanding improves dramatically when models produce structured reasoning traces before classification. The framework uses a five-step reasoning schema: (1) describe what the image shows, (2) interpret what the caption says at face value, (3) detect mismatches between visual and textual meaning, (4) infer the speaker's actual communicative intent, and (5) predict the figurative label. This schema forces explicit incongruity detection -- the mechanism by which sarcasm inverts meaning, humor exploits absurdity, offense weaponizes context, and metaphor maps abstract concepts onto concrete imagery.
The pipeline has three stages. Stage 1 (Distillation): A large teacher VLM (e.g., LLaMA3.2-90B-Vision-Instruct) generates chain-of-thought reasoning traces under zero-shot prompting using the five-step schema. Traces are automatically filtered to retain only those with valid reasoning paths and correct final labels. Stage 2 (Student Training): A smaller student VLM (e.g., Qwen2.5-VL-3B) is fine-tuned via supervised learning on the distilled traces, then further refined with Reinforcement Learning with Verifiable Rewards (RLVR/GRPO) where the reward signal combines accuracy and format adherence. Stage 3 (Cross-Style Unification): Joint SFT followed by GRPO across all four figurative styles produces a single generalized model that outperforms much larger models.
The critical finding for practitioners: cross-style transfer is real and measurable. SFT on sarcasm CoTs followed by RLVR on humor yields ~10% accuracy improvement over training on humor alone. Sarcasm and humor transfer most readily to each other; metaphor and offense share structural similarities. This means you can bootstrap a new figurative style classifier from an existing one with minimal target-style data.
Receive the multimodal input. Identify the image and accompanying text (caption, tweet, meme text overlay). If the user provides only text, note that full cross-modal reasoning requires an image -- fall back to text-only figurative analysis with a caveat.
Step 1 -- Image Description. Describe the visual content objectively. Note objects, people, expressions, settings, visual tone, and any text rendered in the image. Do not interpret intent yet. Wrap this in the reasoning trace.
Step 2 -- Caption Interpretation. Parse the textual caption at face value. Identify its literal meaning, tone markers (punctuation, capitalization, emoji), and any named entities or cultural references.
Step 3 -- Mismatch Detection. Compare the visual content against the textual meaning. Identify specific incongruities: does the image contradict the caption? Does the visual tone clash with the textual sentiment? Is there exaggeration, understatement, or absurdity in the pairing? This is the critical step -- figurative language lives in the gap between expressed and intended meaning.
Step 4 -- Intent Inference. Based on detected mismatches, infer the speaker's actual communicative intent. For sarcasm: the intended meaning is opposite to the literal text. For humor: the mismatch is designed to amuse. For offense: the pairing targets a group or individual. For metaphor: the image maps an abstract concept onto a concrete visual.
Step 5 -- Label Prediction with Justification. Output the figurative style classification (sarcastic/not-sarcastic, humorous/not-humorous, offensive/not-offensive, metaphorical/literal) along with a confidence assessment grounded in the reasoning chain.
Format the reasoning trace. Enclose the full five-step reasoning in <think>...</think> tags and the final classification in <answer>...</answer> tags for downstream parseability.
Apply cross-style awareness. If the content exhibits multiple figurative styles (e.g., sarcastic and humorous), classify each independently using the same reasoning chain, noting shared incongruity signals.
Handle ambiguous cases explicitly. If the mismatch detection step yields weak or contested signals, flag the content as ambiguous rather than forcing a binary label. Report which aspects push toward figurative vs. literal interpretation.
When building a training pipeline: Generate distillation traces from a large teacher model using the five-step schema prompt, filter for correctness, fine-tune a smaller student model via SFT (5 epochs, lr=2e-4, cosine schedule), then apply GRPO (2 epochs, lr=1e-5, 8 rollouts) with a reward combining accuracy + format adherence.
Example 1: Sarcasm Detection in a Meme
User: "Is this meme sarcastic? The image shows a person smiling while their car is on fire. The caption says 'What a great start to my Monday!'"
Approach:
<think>
Step 1 (Image Description): The image shows a person standing next to a car
engulfed in flames. The person is smiling or grinning. The setting appears to
be a parking lot during daytime.
Step 2 (Caption Interpretation): The caption "What a great start to my Monday!"
literally expresses enthusiasm and positivity about the beginning of the week.
The exclamation mark reinforces an upbeat tone.
Step 3 (Mismatch Detection): Strong incongruity detected. The visual content
depicts a clearly negative event (car fire) paired with positive/enthusiastic
text. The person's smile despite the disaster amplifies the contradiction. This
is a textbook sentiment inversion pattern.
Step 4 (Intent Inference): The speaker's actual intent is the opposite of the
literal text. They are expressing frustration, exasperation, or resignation
about a terrible start to their week. The smile is performative -- a coping
mechanism that reinforces the sarcastic framing.
Step 5 (Label): Sarcastic. High confidence. The image-text incongruity follows
a classic sarcasm pattern: positive language applied to an unambiguously
negative situation, with visual cues (forced smile) supporting ironic intent.
</think>
<answer>Sarcastic</answer>
Example 2: Humor vs. Literal in an Image-Text Pair
User: "Classify this: Image shows a dog sitting at a desk wearing glasses with a laptop open. Caption: 'When the code compiles on the first try.'"
Approach:
<think>
Step 1 (Image Description): A dog is seated at an office desk, wearing
reading glasses, positioned in front of an open laptop. The dog's posture
mimics a human office worker. The setting is a typical home office.
Step 2 (Caption Interpretation): The caption references a software development
scenario -- code compiling successfully on the first attempt. This is widely
understood among programmers as a rare and surprising event.
Step 3 (Mismatch Detection): The incongruity operates on two levels. First,