arXiv LaTeX Source Access Guide

URL: GET https://export.arxiv.org/api/query?id_list={arxiv_id}
Response: Atom XML with &lt;title&gt; , &lt;author&gt; , &lt;summary&gt; , &lt;category&gt; , &lt;published&gt;
Example: curl -s &quot;https://export.arxiv.org/api/query?id_list=2301.00001&quot;

Overview

arXiv stores the original LaTeX source files for the vast majority of its 2.4 million+ preprints. Accessing LaTeX source provides major advantages over PDF parsing: exact mathematical notation as written by the author, structured sections and labels, machine-readable bibliography entries, and intact figure captions, table data, and cross-references.

For formula extraction, citation graph construction, section-level text analysis, or training data curation for scientific language models, LaTeX source is the gold standard. PDF parsing introduces OCR errors in equations, loses structural hierarchy, and mangles complex tables.

The e-print endpoint serves source bundles as gzip-compressed tarballs (.tar.gz) containing .tex files, figures, .bib/.bbl bibliography files, style files, and supplementary materials. No authentication is required.

Authentication

No authentication or API key is required. The e-print endpoint is publicly accessible. However, arXiv asks that automated tools set a descriptive header and comply with rate limits.