A skill for ethically collecting bibliometric data from Google Scholar, including search results, citation counts, author profiles, and related articles. Covers rate limiting, CAPTCHA avoidance, alternative APIs, legal considerations, data parsing, and practical workflows that balance data needs with responsible access.

Legal and Ethical Considerations

Before You Scrape

Google Scholar does not offer an official API, and its Terms of Service restrict automated access. Researchers must weigh their data needs against legal and ethical constraints.

Legal landscape:

Terms of Service:
  - Google's ToS prohibit automated queries
  - Violation can result in IP blocking (temporary or permanent)
  - Institutional IPs can be blocked, affecting all campus users
  - In some jurisdictions, ToS violations are not legally binding
    for non-commercial academic research, but this is debated

Ethical guidelines:
  - Minimize load: respect the server, use delays between requests
  - Cache aggressively: never request the same page twice
  - Use official alternatives first (see below)
  - Do not redistribute raw scraped data
  - Cite Google Scholar as your data source in publications
  - Consider whether your research question truly requires
    Google Scholar data, or if Web of Science, Scopus, or
    OpenAlex could answer it instead

Official and semi-official alternatives:
  - OpenAlex API: free, no key required, excellent coverage
  - Crossref API: free, DOI-based metadata and citation counts
  - CORE API: free, full-text open access content
  - Google Scholar Alerts: manual but ToS-compliant monitoring
  - Publish or Perish (software): uses Google Scholar with built-in
    rate limiting, commonly used in bibliometric research

Legal and Ethical Considerations

Before You Scrape

Google Scholar does not offer an official API, and its Terms of Service restrict automated access. Researchers must weigh their data needs against legal and ethical constraints.

Legal landscape: Terms of Service: - Google's ToS prohibit automated queries - Violation can result in IP blocking (temporary or permanent) - Institutional IPs can be blocked, affecting all campus users - In some jurisdictions, ToS violations are not legally binding for non-commercial academic research, but this is debated Ethical guidelines: - Minimize load: respect the server, use delays between requests - Cache aggressively: never request the same page twice - Use official alternatives first (see below) - Do not redistribute raw scraped data - Cite Google Scholar as your data source in publications - Consider whether your research question truly requires Google Scholar data, or if Web of Science, Scopus, or OpenAlex could answer it instead Official and semi-official alternatives: - OpenAlex API: free, no key required, excellent coverage - Crossref API: free, DOI-based metadata and citation counts - CORE API: free, full-text open access content - Google Scholar Alerts: manual but ToS-compliant monitoring - Publish or Perish (software): uses Google Scholar with built-in rate limiting, commonly used in bibliometric research

from scholarly import scholarly, ProxyGenerator def setup_scholarly_with_proxy(): """ Configure scholarly with a free proxy to reduce blocking risk. For heavy usage, consider ScraperAPI or similar paid services. """ pg = ProxyGenerator() # Free proxy (less reliable, suitable for small jobs) pg.FreeProxies() scholarly.use_proxy(pg) def search_scholar(query, max_results=20): """ Search Google Scholar and collect structured results. IMPORTANT: Add delays between queries to avoid blocking. Recommended: 10-30 seconds between searches. """ import time results = [] search_query = scholarly.search_pubs(query) for i in range(max_results): try: result = next(search_query) parsed = { "title": result["bib"].get("title", ""), "author": result["bib"].get("author", []), "year": result["bib"].get("pub_year", ""), "venue": result["bib"].get("venue", ""), "abstract": result["bib"].get("abstract", ""), "citations": result.get("num_citations", 0), "url": result.get("pub_url", ""), } results.append(parsed) # Rate limiting: wait between result fetches time.sleep(2) except StopIteration: break return results def get_author_profile(author_name): """ Retrieve an author's Google Scholar profile. Includes h-index, i10-index, and publication list. """ search_query = scholarly.search_author(author_name) author = next(search_query) author = scholarly.fill(author) profile = { "name": author.get("name", ""), "affiliation": author.get("affiliation", ""), "h_index": author.get("hindex", 0), "i10_index": author.get("i10index", 0), "cited_by": author.get("citedby", 0), "interests": author.get("interests", []), "publications": len(author.get("publications", [])), } return profile

Google Scholar Scraper

Legal and Ethical Considerations

Before You Scrape

Google Scholar Scraper

Legal and Ethical Considerations

Before You Scrape

Data Collection Approaches

Using Scholarly (Python Library)

Rate Limiting and Anti-Blocking

Best Practices

Handling CAPTCHAs and Blocks

Data Processing and Storage

Structuring Collected Data

Recommended Alternatives to Scraping

When Not to Scrape Google Scholar

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns