Run the AI Voice Analyzer on blog content to detect AI-sounding patterns and get actionable rewrite suggestions. Use when reviewing or improving blog articles before publishing.
Analyze blog articles for AI-sounding patterns using NLP, then use the diagnostics to improve the writing before publishing. The goal is content that reads more human than virtually all other blog articles — and avoids being flagged by Google's helpful content signals.
Invoked by the user with /blog-voice-analyzer or when asked to "check" or "analyze" a blog post for AI patterns.
pipenv run python3 scripts/blog/ai_voice_analyzer.py <path-to-markdown-file>
The script accepts any text file (markdown, plain text). It strips frontmatter, HTML, and markdown formatting before analysis.
The analyzer runs 18 independent checks organized into four categories. Each scores 0-100 (higher = more human), and the overall score is a weighted average.
| Check | What It Measures |
|---|
| AI Signature |
|---|
| Human Signature |
|---|
| Sentence length variance | Std dev of word counts per sentence | Clusters around 15-20 words (std dev < 4) | Wild variation: 3-word fragments mixed with 30-word sentences (std dev 8+) |
| Sentence opener diversity | POS patterns of first 2 tokens in each sentence | 40%+ start with "The" or "This" | Fragments, questions, conjunctions, inversions, prepositional phrases |
| Clause depth variety | Max dependency tree depth per sentence | Uniform depth across sentences | Mix of flat simple sentences and deeply nested complex ones |
| Paragraph size variety | Coefficient of variation of paragraph word counts | Every paragraph roughly the same length | One-liners mixed with long blocks |
| Check | What It Measures | AI Signature | Human Signature |
|---|---|---|---|
| Vocabulary diversity (TTR) | Type-token ratio of content words | Low TTR (< 45%) — recycles same words | Higher TTR, though some writers deliberately use simple vocabulary |
| Hedge/filler phrases | Exact match against ~50 AI-marker phrases + ~35 signal words | "It's important to note", "multifaceted", "leverage", "delve", "cornerstone" | Zero matches |
| Weak adverbs | Density of "really", "very", "literally", "significantly", etc. | > 1% density | Replaced with stronger verbs or cut entirely |
| Nominalization density | Nouns ending in -tion, -ment, -ness, -ity, -ence, -ance | > 5% — "reduction", "transition", "consumption" instead of active verbs | < 3% — prefers "reduce", "shift to", "consume" |
| Vague verb phrases | "contributes to", "remains a", "poses a", "provides a", "aims to", etc. | 4-6+ per article | Zero — uses direct assertions |
| Word repetition | Content words exceeding expected frequency (topic words get higher threshold) | "Substantial" 4x in 300 words | Topic words may repeat naturally; non-topic words stay varied |
| Check | What It Measures | AI Signature | Human Signature |
|---|---|---|---|
| Personal voice | First person ("I", "we"), second person ("you"), contractions ("don't", "it's") | Zero of all three | First person for opinions, second person for engagement, contractions for warmth |
| Questions asked | Sentences ending with ? | Zero questions — pure declaration | 5-10% of sentences are questions (rhetorical or direct) |
| Concrete specifics (NER) | Named entities: people, places, dates, numbers, orgs | Zero — everything abstract and generic | Names, dates, numbers, real examples |
| Readability register | Flesch-Kincaid grade + avg syllables per word | Grade 14+ (academic), avg syllables > 1.8 | Grade 6-10 (conversational), avg syllables < 1.6 |
| Check | What It Measures | AI Signature | Human Signature |
|---|---|---|---|
| Passive voice | Dependency labels nsubjpass / auxpass | > 15% of sentences | < 10% |
| Transition word openers | "However", "Furthermore", "Additionally" at sentence start | > 0.5 per paragraph | Let ideas flow without signposting |
| Triple-item lists | "X, Y, and Z" coordinated patterns | > 2 per 1000 words | Not everything comes in threes |
| Paired adjective cliches | "ADJ and ADJ" via dependency parse ("smooth and swift", "widespread and uniform") | > 3 per 1000 words | Picks the stronger word |
| Range | Interpretation |
|---|---|
| 75-100 | Reads naturally. Minor tweaks on flagged items. |
| 55-74 | Some AI patterns visible. Targeted rewrites recommended. |
| 35-54 | Clear AI voice. Significant rewriting needed. |
| 0-34 | Strongly AI-generated. Full rewrite recommended. |
Only sentences with 2+ issues are flagged (reduces noise). Each is labeled HIGH/MED severity with specific diagnostics like:
[HIGH] #13:
"The variability in charging station availability, especially in rural areas, poses a challenge..."
→ Vague verb: "poses a"
→ Nominalization-heavy (4): variability, station, availability, distance
→ Length (20w) ≈ average (21w)
These are the sentences to rewrite first.
The summary lists the 5 worst-scoring dimensions with specific actions. Address these in order.
pipenv run python3 scripts/blog/ai_voice_analyzer.py path/to/article.md
Work through the priority fixes list. The most impactful improvements by category:
If Personality scores low (< 70):
If Questions score low (< 50):
If Readability scores low (< 70):
If Nominalization scores low (< 70):
If Hedge Phrases are detected:
If Vague Verbs are detected:
If Sentence Variance is low (< 70):
pipenv run python3 scripts/blog/ai_voice_analyzer.py path/to/article-v2.md
Target: overall score > 80, zero HIGH-severity flagged sentences.
To compare how different AI models perform on similar prompts:
pipenv run python3 scripts/blog/benchmark_models.py
This generates 3 articles per model (Claude Sonnet, Haiku; GPT-4o, 4o-mini, o3-mini) on psychology/productivity topics and scores them all. Edit TOPICS and MODELS in the script to customize.
Things the analyzer catches well:
Things the analyzer does NOT catch:
From testing across known human and AI text:
| Text | Score | Notes |
|---|---|---|
| Paul Graham essays | 85-86 | Gold standard for clear, human prose |
| Clarido blog post (AI-written, edited) | 84 | Well-crafted AI output with personality |
| Claude Haiku (default prompt) | 82 avg | Best out-of-the-box AI model |
| Claude Sonnet (default prompt) | 77 avg | Higher grade level, more nominalizations |
| GPT-4o-mini (default prompt) | 70 avg | Low personality, no contractions |
| o3-mini (default prompt) | 68 avg | Zero contractions, impersonal |
| GPT-4o (default prompt) | 66 avg | Worst personality, fewest questions |
| Raw ChatGPT (no prompting) | 40 | Clear AI voice across all dimensions |
Target for published content: 80+ with zero HIGH-severity flagged sentences.
Dependencies: spacy, textstat (Python, installed via pipenv). Requires the en_core_web_sm spaCy model.
How scoring works: Each of the 18 dimensions scores 0-100. The overall score is a weighted average with personality (0.12), hedge phrases (0.10), entity density (0.08), sentence variance (0.08), and vague verbs (0.07) weighted highest. Full weight table is in scripts/blog/ai_voice_analyzer.py in the analyze() function.
Sentence flagging threshold: Only sentences with 2+ co-occurring issues are flagged. Single-issue sentences are not surfaced to reduce noise. The "length close to average" flag only triggers when the document's overall sentence length std dev is below 6 (indicating genuine monotony).
| File | Purpose |
|---|---|
scripts/blog/ai_voice_analyzer.py | Main analyzer — run on any text file |
scripts/blog/benchmark_models.py | Generate + score articles across Claude and GPT models |
tmp/benchmark/ | Cached generated articles and results.json from benchmark runs |