This skill enables an AI agent to conduct a rigorous, structured literature review following established academic methodology. The agent defines a search strategy with targeted keywords, applies explicit inclusion and exclusion criteria to filter results, extracts key data from selected papers, and synthesizes the findings into a thematic narrative with a summary table and reference list. The workflow is inspired by systematic review practices (including PRISMA-style reporting) and is suitable for academic research, technology landscape analysis, and evidence-based decision making.

Workflow

Define the Research Question and Scope: Work with the user to formulate a precise research question using a framework such as PICO (Population, Intervention, Comparison, Outcome) or a domain-appropriate equivalent. Establish the review's scope: time range, languages, source types (journal articles, conference papers, preprints), and any domain constraints.
Develop the Search Strategy: Generate a set of search queries using combinations of primary keywords, synonyms, and Boolean operators. Identify the databases and sources to search (e.g., Google Scholar, Semantic Scholar, arXiv, PubMed, ACM Digital Library, IEEE Xplore). Document the complete search strategy for reproducibility.

Workflow

Define the Research Question and Scope: Work with the user to formulate a precise research question using a framework such as PICO (Population, Intervention, Comparison, Outcome) or a domain-appropriate equivalent. Establish the review's scope: time range, languages, source types (journal articles, conference papers, preprints), and any domain constraints.
Develop the Search Strategy: Generate a set of search queries using combinations of primary keywords, synonyms, and Boolean operators. Identify the databases and sources to search (e.g., Google Scholar, Semantic Scholar, arXiv, PubMed, ACM Digital Library, IEEE Xplore). Document the complete search strategy for reproducibility.

Database	Query
Semantic Scholar	"large language model evaluation benchmark" AND (reasoning OR safety OR "task completion")
arXiv	"LLM benchmark" AND ("2023" OR "2024" OR "2025")
ACM DL	"language model assessment" AND "benchmark suite"
Google Scholar	"LLM evaluation" survey OR "systematic review" 2023..2025

Criteria	Type	Rule
Published 2022-2025	Inclusion	Must be within date range
Peer-reviewed or major preprint	Inclusion	Accepted at top venues or arXiv with 10+ citations
Proposes or surveys benchmarks	Inclusion	Must discuss specific evaluation frameworks
Blog posts / opinion pieces	Exclusion	No non-academic sources
Non-English	Exclusion	English-language only
Duplicates / superseded versions	Exclusion	Keep most recent version only

Benchmark	Year	Focus Area	Key Metric	Limitations
MMLU	2023	Knowledge & reasoning	Accuracy across 57 tasks	Static; no multi-step reasoning
HumanEval+	2023	Code generation	pass@k	Narrow scope (Python functions)
AgentBench	2023	Real-world task completion	Success rate across 8 environments	High cost to run; environment-specific
TrustLLM	2024	Safety & trustworthiness	6 dimensions including fairness	Self-reported; needs human validation
GPQA	2024	Graduate-level reasoning	Accuracy on expert-written questions	Small dataset; domain-specific
SWE-bench	2024	Software engineering tasks	Resolved rate on real GitHub issues	Requires execution infrastructure

Field	Description
Paper ID	Unique identifier
Model(s) tested	Which LLMs were evaluated
RAG architecture	Retrieval method, chunk size, top-k
Baseline	What non-RAG setup was compared
Hallucination metric	FActScore, human eval, accuracy, etc.
Result	Percentage change in hallucination rate
Domain	General knowledge, medical, legal, etc.

Literature Review

Workflow

Literature Review

Workflow

Usage

Examples

Example 1: Literature Review on LLM Evaluation Benchmarks

Example 2: Systematic Review Protocol with PRISMA-Style Flow

Best Practices

Edge Cases

Goplaces

Research Ops

Editor

Fact Checker

Deep Research

Academic Researcher