Guide Socratic discovery of research questions and analytical approaches for text corpus analysis
Transform vague interest in a text corpus into concrete, answerable research questions with clear analytical roadmaps. Prevents aimless exploration and "fishing expedition" research.
Core innovation: Uses Socratic questioning to help researchers articulate what they don't yet consciously know they want to discover.
Core principle: Discovery before execution. Research questions emerge through dialogue, not pre-formed declarations. The researcher makes all interpretive decisions; this skill guides the process.
Invocation Triggers:
Use This When:
Skip This When:
Required:
Optional:
corpus_file_path - If corpus is in accessible format (JSON, CSV, TXT)domain_context - Background information about the corpus domainresearch_constraints - Time, computational resources, publication goalsIf corpus not yet loaded: Ask: "Can you share 3-5 example texts from your corpus so I can understand the content?"
I'm using the Corpus Discovery Dialogue skill to help you formulate research questions
and design an analytical approach for your corpus.
This process takes 20-40 minutes and involves five phases:
1. Understanding your corpus (5-10 min)
2. Exploring your interests (5-10 min)
3. Formulating questions (5-10 min)
4. Mapping methodologies (5-10 min)
5. Creating research roadmap (5-10 min)
We'll move at your pace, with confirmation checkpoints along the way.
Let's begin by understanding what you're working with.
Goal: Establish factual baseline about corpus and ground dialogue in actual texts
Actions:
Request corpus samples (if not already provided):
To ground our conversation in your actual data, please share 3-5 example texts
from your corpus.
This helps me:
- Understand the domain and content
- Suggest appropriate analytical methods
- Generate realistic research questions
- Ground our dialogue in evidence
You can paste the texts directly, provide file paths, or describe the corpus
with representative excerpts.
Review samples and extract initial observations:
Present initial observations for validation:
## Initial Observations from Sample Texts
Based on the examples you shared, I notice:
**Content Type:** [Theater reviews / Business reports / etc.]
**Writing Style:** [Evaluative, analytical, narrative-driven]
**Length Range:** [~300-700 words per document]
**Domain Terminology:** [Performance, staging, direction, etc.]
**Time References:** [Dates present, suggest temporal corpus]
Does this match your understanding of the full corpus?
- Yes → These samples are representative
- No → What's different about the full corpus?
Wait for user confirmation before proceeding to Phase 1.
Goal: Establish factual baseline about the corpus through structured inquiry
Method: Ask questions sequentially, one at a time, with structured choices
What kind of texts make up your corpus?
A) Literary texts (novels, poetry, plays, reviews)
B) Professional writing (business documents, technical reports)
C) Social media / conversational data (tweets, chat logs, forum posts)
D) Academic writing (papers, dissertations, grant proposals)
E) Historical documents (letters, diaries, archival materials)
F) Other: [let me describe]
This helps me understand what analytical approaches are appropriate.
Wait for response. Based on answer, follow up with domain-specific clarification if needed.
Approximately how large is your corpus?
A) Small (1-50 documents, <50K words)
- Suitable for: Close reading + computational validation
- Methods: Manual review with computational support
B) Medium (50-500 documents, 50K-500K words)
- Suitable for: Mixed methods (computational + selective manual review)
- Methods: Topic modeling, sentiment analysis, entity extraction
C) Large (500-5000 documents, 500K-5M words)
- Suitable for: Computational methods with manual validation
- Methods: Full NLP pipeline, network analysis, temporal tracking
D) Very Large (5000+ documents, 5M+ words)
- Suitable for: Fully automated computational approaches
- Methods: Large-scale topic modeling, trend analysis, statistical patterns
E) I'm not sure yet
Why this matters: Corpus size determines which analytical methods are appropriate
and how much manual validation is feasible.
Wait for response.
Do your texts span a time period?
A) Yes, specific date range: [YYYY-MM-DD to YYYY-MM-DD]
B) Yes, approximate span: [e.g., "1990s to 2020s"]
C) No, all from same time period
D) Unknown / mixed temporal markers
If yes: Are you interested in how patterns change over time?
- Very interested (temporal evolution is central question)
- Somewhat interested (might explore if patterns emerge)
- Not interested (time is just a descriptor, not analytic focus)
Wait for response. If temporal interest indicated, note for Phase 3 question generation.
Who created these texts?
A) Single author (one person's writing over time)
- Enables: Voice evolution, style development, thematic shifts
B) Small group (2-10 identified authors)
- Enables: Comparative analysis, author attribution, collaborative patterns
C) Large group (institutional/collective authorship)
- Enables: Consensus analysis, organizational voice, policy evolution
D) Anonymous / crowd-sourced (many unknown authors)
- Enables: Cultural patterns, collective discourse, emergence
E) Mixed / varies
Why this matters: Authorship affects whether you're studying individual voice evolution,
group consensus, or broader cultural patterns.
Wait for response.
What format is your corpus in?
A) Structured data (JSON, CSV, database) with metadata
- Metadata available: [dates, categories, author info, etc.]
B) Plain text files (.txt, .md)
- Organization: [by date / by category / by author / other]
C) Rich documents (PDF, DOCX) needing extraction
- Extraction needed: Yes, will require preprocessing
D) Web pages / HTML needing scraping
- Scraping needed: Yes, will require web extraction
E) Mixed formats
F) Not yet digitized (physical documents)
Do you have metadata? (dates, categories, author info, etc.)
- Yes, extensive metadata (multiple fields)
- Yes, basic metadata (dates only)
- No metadata (just text)
Wait for response.
Do you currently have access to the full corpus, or are you still gathering it?
A) Full corpus ready (can begin analysis anytime)
B) Partial corpus ready (~50-75% collected)
C) Still gathering (in early collection phase)
D) Need help acquiring/formatting the data
If gathering or formatting needed, we can adjust timeline expectations.
Wait for response.
Output: Corpus Profile
After all Phase 1 questions answered, synthesize into structured profile:
## Corpus Profile
**Domain:** [Theater criticism / Business reports / etc.]
**Type:** [Literary reviews / Professional documents / etc.]
**Size:** [Medium corpus: ~464 documents, ~250K words]
**Temporal Span:** [2010-2025 (15 years)]
**Authorship:** [Single author]
**Format:** [JSON with date metadata]
**Metadata Available:** [Dates, titles, some categories]
**Data Status:** [Ready for analysis / Partially ready / Still gathering]
**Analytical Implications:**
- Size supports: [Topic modeling, sentiment analysis, entity extraction, network analysis]
- Temporal span enables: [Evolution tracking, trend analysis, period comparison]
- Single authorship enables: [Voice evolution, style development, personal pattern tracking]
- Metadata supports: [Temporal aggregation, categorical comparison]
Does this accurately describe your corpus?
- Yes → Proceed to Phase 2
- No → What should I correct?
Wait for user confirmation before proceeding.
If user indicates corrections:
Goal: Uncover researcher's tacit interests, assumptions, and motivations
Method: Ask open-ended and structured questions to reveal what researcher hopes to discover
Important: This phase reveals assumptions worth testing. "Surprise" and "disappointment" questions make tacit expectations explicit.
What drew you to this corpus?
A) Personal connection (my own writing, my field, my community)
B) Historical significance (important cultural/social moment)
C) Theoretical interest (test a hypothesis, apply a framework)
D) Practical need (dissertation, publication requirement)
E) Curiosity (something seemed interesting but not sure what)
F) Other: [describe]
Tell me more about what sparked your interest.
Wait for response. Allow open-ended elaboration.
What puzzles or intrigues you about these texts?
Think about:
- Something that seems inconsistent or contradictory
- A pattern you notice but can't quite articulate
- A change over time you sense but haven't confirmed
- A question others haven't asked but should
- Something that doesn't match your expectations
Share 1-3 puzzles or curiosities. These don't have to be fully formed —
vague hunches are exactly what we want to clarify.
Wait for response. Allow open-ended exploration. Probe for specifics if too vague:
If vague, ask:
That's interesting. Can you give me an example from the texts where you noticed this?
What would surprise you to discover in this corpus?
A) That patterns you expected are absent
B) That unexpected patterns dominate
C) That the corpus contradicts prevailing wisdom in your field
D) That temporal changes are dramatic (or non-existent)
E) That relationships you assumed don't exist (or that unexpected ones do)
F) Something else: [describe]
What would genuinely surprise you?
This reveals assumptions worth testing — surprises often indicate tacit hypotheses.
Wait for response.
Follow-up:
Why would that surprise you? What do you currently assume is true?
This makes assumptions explicit and testable.
What would disappoint you NOT to find?
A) Evidence for a pattern you suspect exists
B) Temporal evolution (if corpus seems static)
C) Distinctive voice or style markers
D) Connections between elements (entities, themes, language)
E) Something you could publish or present
F) Confirmation of your field's prevailing theories
What discovery would feel like the project "didn't pay off"?
Wait for response.
This reveals researcher's hopes and helps ensure questions address genuine interests.
What are the stakes of this research?
A) Academic (dissertation chapter, publication, tenure case)
- Who's your audience? [Field specialists / interdisciplinary / general public]
B) Professional (improve practice, inform policy, guide decisions)
- Who will use these findings?
C) Personal (understand my own work, document my evolution)
- What will this help you understand?
D) Cultural (preserve/interpret cultural moment, community history)
- What's at risk if this story isn't told?
E) Methodological (test new analytical approaches, advance methods)
- What methodological contribution does this make?
How will you use these findings? Who cares about what you discover?
Wait for response.
Output: Interest Map
After Phase 2 questions, synthesize researcher's interests into structured map:
## Interest Map
**Primary Draw:** [Personal connection - your own theater criticism over 15 years]
**Puzzles & Intrigues:**
1. [Sense that your critical voice has evolved but not sure how]
2. [Notice some plays/venues appear repeatedly - what patterns exist?]
3. [Feel like your language has changed - more/less positive? Different vocabulary?]
**Surprising Discoveries (would reveal assumptions):**
- Surprise if: [No temporal evolution - assumes change occurred]
- Surprise if: [Sentiment trends opposite to memory - assumes accurate self-awareness]
- Current assumption: [You believe you've become more/less positive over time]
**Disappointing Non-Discoveries:**
- Disappointed if: [No distinctive voice markers - wants evidence of unique style]
- Disappointed if: [No entity relationship patterns - suspects network exists]
**Stakes:** [Personal + Professional - understand critical evolution, possibly publish]
**Audience:** [Theater practitioners, other critics, potentially academic audience]
**Implicit Hypotheses Worth Testing:**
1. [Voice has evolved over 15 years - testable with stylistic/sentiment analysis]
2. [Certain entities are central to network - testable with entity extraction + network analysis]
3. [Language/sentiment has shifted - testable with sentiment analysis + lexicon comparison]
Does this capture your interests and assumptions?
- Yes → Proceed to Phase 3
- No → What should I adjust?
Wait for user confirmation.
If adjustments needed:
Goal: Transform interests into concrete, answerable research questions
Method: Generate 8-12 candidate questions across categories, then guide selection
Key Principle: Questions must be:
Based on Corpus Profile (Phase 1) and Interest Map (Phase 2), generate 8-12 questions across these categories:
Template:
## Candidate Research Questions
Based on your corpus and interests, here are potential research questions organized by type.
I've tailored these to your [corpus size], [temporal span], [authorship type], and your
interest in [key interests from Phase 2].
---
### Category 1: Temporal Questions (Evolution Over Time)
[**Include ONLY if corpus has temporal span AND user indicated temporal interest**]
These questions examine how patterns change across [time span].
**Q1a: How has [specific aspect from Phase 2] evolved from [start year] to [end year]?**
**Analytical Approach:** [Method name]
- Primary tool: [Tool name and what it does]
- Procedure: [3-4 step summary]
**What This Reveals:** [What you'd learn]
**Evidence Type:** [What outputs would look like]
- [Output 1]: [Description]
- [Output 2]: [Description]
**Validation Strategy:** [How to verify findings]
- [Validation method 1]
- [Validation method 2]
**Example for your corpus:**
> **Q1a: How has the emotional tenor of your theater criticism evolved 2010-2025?**
>
> **Analytical Approach:** Sentiment analysis with temporal tracking
> - Primary tool: VADER (lexicon-based) + Transformers (distilbert-based)
> - Procedure:
> 1. Score each review with both tools
> 2. Aggregate sentiment by year
> 3. Visualize trends with confidence intervals
> 4. Identify turning points or steady trends
>
> **What This Reveals:** Whether you've become more/less positive, whether enthusiasm
> has changed, if there are distinct periods in your critical stance
>
> **Evidence Type:**
> - Sentiment scores aggregated by year (line graph with trend)
> - Statistical tests for significant changes (before/after comparison)
> - Representative reviews from high/low sentiment periods
>
> **Validation Strategy:**
> - Manual review of 20 highest-sentiment reviews (check for sarcasm, irony)
> - Manual review of 20 lowest-sentiment reviews (check for genuine negativity)
> - Compare with external events (theater trends, personal life events if known)
> - Cross-tool validation (VADER vs Transformer correlation)
>
> **Confidence:** High - sentiment analysis is well-established for this corpus size
**Q1b: How have [thematic element from Phase 2] changed over time?**
[Similar structure as Q1a]
[Present 2-3 temporal questions if applicable]
---
### Category 2: Thematic Questions (Pattern Discovery)
These questions identify dominant themes, concepts, and patterns across the corpus.
**Q2a: What are the dominant [domain-specific themes] in your [corpus type]?**
**Analytical Approach:** Topic modeling
**Example for your corpus:**
> **Q2a: What are the dominant critical frameworks in your theater reviews?**
>
> **Analytical Approach:** Topic modeling with BERTopic
> - Primary tool: BERTopic (semantic topic modeling)
> - Procedure:
> 1. Generate document embeddings (sentence transformers)
> 2. Cluster documents by semantic similarity
> 3. Extract representative terms per topic
> 4. Label topics based on term analysis + manual review
>
> **What This Reveals:** Whether you emphasize performance, direction, writing,
> social context, production values, audience experience, etc. What critical
> lenses dominate your work.
>
> **Evidence Type:**
> - Topic model with 8-15 coherent topics
> - Representative reviews per topic (example documents)
> - Topic prevalence across corpus (distribution)
> - Topic keywords and phrases (c-TF-IDF scores)
>
> **Validation Strategy:**
> - Manual review of 5 reviews per topic (coherence check)
> - Topic distinctiveness assessment (are topics meaningfully different?)
> - Compare with conscious critical approach (does this match your self-understanding?)
> - Check for junk topics (noise clusters, incoherent groupings)
>
> **Confidence:** High - BERTopic performs well on this corpus size
**Q2b: What [specific content type] appears most frequently?**
[Similar structure]
[Present 2-3 thematic questions]
---
### Category 3: Relational Questions (Networks & Connections)
These questions explore relationships between entities (people, places, works, concepts).
**Q3a: What relationships exist between [entity types]?**
**Analytical Approach:** Entity extraction + network analysis
**Example for your corpus:**
> **Q3a: What plays, venues, and artists are central to your critical network?**
>
> **Analytical Approach:** Named entity recognition + co-occurrence network analysis
> - Primary tool: spaCy NER (entity extraction) + NetworkX (network construction)
> - Procedure:
> 1. Extract named entities (people, organizations, works)
> 2. Build co-occurrence network (entities mentioned together)
> 3. Calculate centrality metrics (which entities are most central)
> 4. Identify clusters (which entities appear together frequently)
>
> **What This Reveals:** Which plays/venues/artists you review most frequently,
> which entities cluster together (e.g., small theater ecosystem vs. major venues),
> what the structure of your critical attention looks like
>
> **Evidence Type:**
> - Network visualization (nodes = entities, edges = co-occurrence)
> - Centrality rankings (top 20-50 entities by importance)
> - Cluster analysis (community detection results)
> - Temporal network evolution (if time-based)
>
> **Validation Strategy:**
> - Manual review of top 50 entities (correctly identified?)
> - Check for entity normalization issues ("CTG" vs "Center Theatre Group")
> - Validate co-occurrence patterns (do relationships make sense?)
> - Cross-reference with LA theater ecosystem (external validation)
>
> **Confidence:** Medium-High - NER works well but may need manual cleanup
[Present 1-2 relational questions]
---
### Category 4: Stylistic Questions (Language & Voice)
These questions examine how language is used distinctively across corpus.
**Q4a: What language distinguishes [period A] from [period B]?**
**Analytical Approach:** Comparative keyness analysis + collocation extraction
**Example for your corpus:**
> **Q4a: What language distinguishes your early (2010-2015) from recent (2020-2025) criticism?**
>
> **Analytical Approach:** Scattertext visualization + keyness testing
> - Primary tool: Scattertext (contrastive lexicon analysis)
> - Procedure:
> 1. Split corpus into early vs recent periods
> 2. Calculate keyness (log-likelihood) for all terms
> 3. Identify distinctive vocabulary for each period
> 4. Extract collocations (multi-word phrases) per period
>
> **What This Reveals:** Vocabulary shifts, new terminology adopted, phrases
> abandoned, evolving critical language
>
> **Evidence Type:**
> - Scattertext visualization (interactive chart showing distinctive terms)
> - Keyness-ranked term lists (statistically distinctive vocabulary)
> - Collocation frequency tables (signature multi-word phrases)
> - Example sentences showing terms in context
>
> **Validation Strategy:**
> - Manual review of distinctive terms (do they feel authentic?)
> - Check for stopword contamination (are distinctive terms meaningful?)
> - Verify collocation context (are phrases used consistently?)
> - Compare with self-reported language evolution
>
> **Confidence:** High - keyness testing is statistically robust
**Q4b: What are the signature phrases or linguistic markers of [corpus type]?**
[Similar structure]
[Present 2-3 stylistic questions]
---
### Category 5: Comparative Questions (Contrast & Difference)
[**Include ONLY if corpus has clear subcategories or comparison groups**]
These questions compare different subsets of the corpus.
**Q5a: How does [subset A] differ from [subset B]?**
[Tailor based on corpus structure - e.g., early vs late, category A vs B, positive vs negative]
[Present 1-2 comparative questions if applicable]
---
## Question Selection
Now that you've seen 8-12 possible research questions, let's select the ones
that align best with your interests and corpus capabilities.
**Selection Criteria:**
1. **Interest Alignment:** Which questions address your puzzles from Phase 2?
2. **Corpus Fit:** Which are answerable with your corpus size/format?
3. **Method Clarity:** Which have clear analytical approaches?
4. **Meaningful Findings:** Which would produce results you care about?
5. **Feasibility:** Which can you realistically complete given constraints?
**Select your top 3-5 research questions** by considering these criteria.
You can:
- Select questions as written
- Modify questions to better fit your interests
- Combine elements from multiple questions
- Propose entirely new questions inspired by these examples
Which 3-5 questions would you like to pursue?
Please indicate by question number (Q1a, Q2b, etc.) or describe modifications.
Wait for user to select/modify questions.
If user proposes modifications:
If user wants to combine questions:
Output: Research Question Framework (Preliminary)
After user selects questions, create structured framework:
## Selected Research Questions
### Primary Questions (User-Selected)
**RQ1:** [Question text]
- **Priority:** High
- **Rationale:** [Why this question matters to you - drawn from Phase 2]
- **Addresses Interest:** [Links to specific puzzle/intrigue from Interest Map]
- **Expected Contribution:** [What this adds to knowledge/understanding]
**RQ2:** [Question text]
- **Priority:** High
- **Rationale:** [Why this question matters]
- **Addresses Interest:** [Links to Interest Map]
- **Expected Contribution:** [What this adds]
**RQ3:** [Question text]
- **Priority:** Medium/High
- **Rationale:** [Why this question matters]
- **Addresses Interest:** [Links to Interest Map]
- **Expected Contribution:** [What this adds]
[Continue for all selected questions]
### Deferred Questions (Interesting but Not Priority)
[Questions user found interesting but not immediate priority - keep for potential future exploration]
**Dependency Analysis:**
- **Independent questions** (can pursue in any order): [List]
- **Sequential questions** (one informs another): [List with order]
- **Complementary questions** (strengthen each other if pursued together): [List]
Does this prioritization feel right?
- Yes → Proceed to Phase 4
- No → Let's reorder or reconsider
Wait for confirmation.
Goal: Map each selected research question to specific analytical approaches with clear implementation paths
Method: For each question, specify tools, procedures, evidence, and validation
Important: Methods must match corpus size, user expertise, and available resources.
Template (repeat for each RQ):
## Methodological Roadmap
For each research question, I'll propose a specific analytical approach with tools,
procedures, evidence types, and validation strategies.
This roadmap is designed for your corpus: [size], [format], [temporal properties].
---
### RQ1: [Question text]
**Analytical Approach:** [Method name - e.g., Sentiment Analysis with Temporal Tracking]
#### Tools Required
**Primary Tool:**
- **Name:** [e.g., VADER]
- **What it does:** [Lexicon-based sentiment scoring from -1 to +1]
- **Why this tool:** [Fast, interpretable, works well on evaluative text]
- **Installation:** `pip install vaderSentiment`
- **Estimated cost:** Free (open source)
**Secondary Tool:**
- **Name:** [e.g., Transformers (distilbert)]
- **What it does:** [Neural sentiment classification]
- **Why this tool:** [Captures nuance, handles sarcasm better]
- **Installation:** `pip install transformers`
- **Estimated cost:** Free (open source, runs locally)
**Validation Tool:**
- **Name:** [Manual review spreadsheet]
- **What it does:** [Human verification of computational results]
- **Why this tool:** [Ensures accuracy, catches model errors]
#### Procedure (Step-by-Step)
1. **Data Preparation** [Estimated time: 1-2 hours]
- Load corpus into analysis environment (Python/R)
- Clean text (minimal - preserve punctuation for sentiment)
- Extract temporal metadata (dates)
- Validate data quality (no missing values, encoding issues)
2. **Sentiment Scoring** [Estimated time: 2-3 hours]
- Score each document with VADER (compound score)
- Score each document with Transformer (positive probability)
- Store scores with document IDs and dates
3. **Temporal Aggregation** [Estimated time: 1-2 hours]
- Group scores by year (or month if fine-grained analysis needed)
- Calculate mean sentiment per period
- Calculate confidence intervals (standard deviation, standard error)
- Identify outliers (documents with extreme scores)
4. **Trend Analysis** [Estimated time: 1-2 hours]
- Visualize sentiment over time (line graph with error bars)
- Test for significant trends (linear regression, Mann-Kendall test)
- Identify change points (periods where trend shifts)
- Compare VADER vs Transformer results (correlation check)
5. **Validation** [Estimated time: 2-3 hours]
- Manual review of 20 highest-sentiment documents
- Manual review of 20 lowest-sentiment documents
- Check for sarcasm, irony, or misclassification
- Document validation findings
6. **Interpretation** [Estimated time: 1-2 hours]
- Write interpretation of findings
- Connect to research question
- Note limitations and caveats
#### Evidence This Produces
**Quantitative Evidence:**
- Sentiment score dataset (CSV: document_id, date, vader_score, transformer_score)
- Temporal aggregation table (year, mean_sentiment, std_dev, n_documents)
- Trend statistics (slope, p-value, R-squared)
- Correlation metrics (VADER vs Transformer agreement)
**Visual Evidence:**
- Sentiment-over-time line graph with confidence intervals
- Distribution plots (sentiment score histograms per period)
- Scatter plot (VADER vs Transformer comparison)
- Change point visualization (if applicable)
**Qualitative Evidence:**
- Example documents from high/low sentiment periods (with scores)
- Validation notes from manual review
- Interpretation connecting scores to content
**What This Shows:**
- Whether sentiment has increased/decreased over time
- How stable or volatile sentiment is
- Periods of particularly high/low sentiment
- Reliability of computational sentiment analysis for this corpus
#### Validation Strategy
**Computational Validation:**
- Cross-tool agreement (VADER vs Transformer correlation > 0.7 indicates reliability)
- Outlier analysis (investigate documents with divergent scores)
- Baseline comparison (is sentiment range plausible for this domain?)
**Manual Validation:**
- Review 40 documents (20 high, 20 low sentiment)
- Code each as: Correctly classified / Misclassified / Ambiguous
- Acceptable error rate: <20% misclassified
- If error rate high, adjust method or caveat findings
**External Validation:**
- Compare sentiment trends with known events (e.g., theater closures, cultural moments)
- Cross-reference with qualitative self-reflection (does trend match memory?)
- Check against field knowledge (do findings align with domain expertise?)
**Alternative Explanation Check:**
- Could sentiment change reflect selection bias (reviewing different types of shows)?
- Could it reflect linguistic drift (changing language norms, not genuine sentiment)?
- Could it reflect method artifacts (tool limitations, not real pattern)?
#### Example Output
**Finding:** Sentiment increased 12% from 2010-2025 (VADER: 0.34→0.41, p<0.01)
**Example High-Sentiment Review (2024):**
> "This production is a triumph — intimate, visceral, and deeply moving. The
> performances are revelatory."
> [VADER: 0.89, Transformer: 0.95]
**Example Low-Sentiment Review (2012):**
> "Despite strong performances, the production feels uneven and lacks a clear
> directorial vision."
> [VADER: 0.12, Transformer: 0.18]
**Interpretation:** Critical language became more enthusiastic over 15 years, though
whether this reflects genuine appreciation shifts or evolving linguistic norms
requires further investigation.
#### Confidence in Answerability
**Confidence Level:** High
**Why High Confidence:**
- Sentiment analysis is well-established for this corpus size
- Tools are validated and reliable
- Evidence type is clear and measurable
- Validation strategy is robust
- Corpus format supports this analysis (text + dates)
**Potential Challenges:**
- Sarcasm/irony may confuse sentiment tools (mitigated by manual review)
- Domain-specific language may not align with general sentiment lexicons (mitigated by dual tools)
- Selection bias (what gets reviewed) may confound trends (can be investigated but not fully resolved)
#### Estimated Effort
**Time:** 8-12 hours total
- Setup & data prep: 1-2 hours
- Analysis execution: 4-6 hours
- Validation: 2-3 hours
- Interpretation & write-up: 1-2 hours
**Expertise Required:**
- Basic Python or R (can run with provided code snippets)
- Understanding of sentiment analysis (can be learned quickly)
- Domain knowledge for validation (you have this)
**Computational Resources:**
- Standard laptop sufficient
- No GPU required
- Local execution (no API costs)
**Dependencies:**
- Completion of RQ1 doesn't block other questions (can run in parallel)
- Results inform interpretation of other analyses (sentiment context for themes)
---
**After completing RQ1 mapping, ask:**
Does this approach make sense for RQ1?
**Wait for confirmation before proceeding to next RQ.**
[Repeat template for RQ2, RQ3, etc.]
---
### After Mapping All Questions
**Output: Methodological Summary**
```markdown
## Methodological Summary
| Research Question | Method | Primary Tools | Evidence Type | Confidence | Time Estimate |
|-------------------|--------|---------------|---------------|------------|---------------|
| RQ1: [short form] | Sentiment analysis | VADER, Transformers | Sentiment scores by year | High | 8-12 hours |
| RQ2: [short form] | Topic modeling | BERTopic | Topic evolution heatmaps | High | 10-15 hours |
| RQ3: [short form] | Entity extraction | spaCy, NetworkX | Co-occurrence network | Medium-High | 12-18 hours |
| RQ4: [short form] | Keyness analysis | Scattertext | Distinctive term lists | High | 6-10 hours |
[Continue for all RQs]
**Total Estimated Effort:** [Sum of estimates] hours = [X-Y weeks at Z hours/week]
**Estimated Timeline:**
- At 5 hours/week: [X weeks]
- At 10 hours/week: [Y weeks]
- At 20 hours/week: [Z weeks]
**Dependency Chain (Recommended Execution Order):**
1. **Phase A: Data Preparation** (Required first)
- Load corpus, validate format, create metadata structures
- Estimated time: 2-4 hours
- Blocks: All analyses
2. **Phase B: Independent Analyses** (Can run in parallel)
- RQ1: [Method] - [Time estimate]
- RQ2: [Method] - [Time estimate]
- [Continue for independent questions]
3. **Phase C: Dependent Analyses** (Require Phase B completion)
- RQ[n]: [Method] - Requires RQ[m] results
- [Continue for dependent questions]
**Critical Decision Points:**
After Data Preparation (Phase A):
- Validate data quality before proceeding
- If data issues found, resolve before continuing
- Checkpoint: "Is the corpus ready for analysis?"
After RQ1 Analysis:
- Validate results with manual review
- If validation reveals issues, tune method before proceeding
- Checkpoint: "Are computational results reliable?"
After RQ2 Analysis:
- Check if topics align with intuitions
- If topics seem incoherent, adjust parameters (min_cluster_size, diversity)
- Checkpoint: "Do topics make sense for the domain?"
[Continue for other critical checkpoints]
**Resource Requirements:**
**Software:**
- Python 3.8+ or R 4.0+
- Required packages: [List all unique packages across RQs]
- Total installation size: ~2-3 GB
**Hardware:**
- RAM: 8 GB minimum, 16 GB recommended
- Storage: [Corpus size + 2x for outputs] = [X GB]
- CPU: Standard laptop sufficient (no GPU required for this corpus size)
**External Resources:**
- None required (all tools are open source and free)
- Optional: External validation sources (theater databases, event timelines)
**Expertise:**
- Programming: Basic to intermediate (code examples provided)
- Statistics: Basic understanding of p-values, correlations
- Domain knowledge: You have this (critical for validation)
Does this roadmap feel achievable and valuable?
- Yes → Proceed to Phase 5 (Output Generation)
- No → What should we revise? (scope, methods, effort estimates, timeline)
Wait for confirmation.
If user has concerns:
Only proceed to Phase 5 when user confirms roadmap is achievable.
Goal: Generate comprehensive research framework document that serves as execution guide
Method: Compile all Phase 1-4 outputs into structured markdown document with execution timeline
I'll now create your Research Questions Framework document. This will include:
- Corpus profile (from Phase 1)
- Interest map and assumptions (from Phase 2)
- Selected research questions with rationales (from Phase 3)
- Complete methodological roadmap (from Phase 4)
- Execution timeline with validation checklists
- Success criteria for the research
This document will serve as your guide throughout the analysis process.
Where should I save this?
Recommended path: ~/[project-name]/00-research-questions-framework.md
Or provide custom path: _____________
Wait for path confirmation.
If user provides directory without filename:
00-research-questions-framework.mdIf path doesn't exist:
Template:
---