Overview

The PRIDE Archive (ProteomicsIDEntifications database) at EBI is the world's largest public repository of mass spectrometry-based proteomics data, containing 30,000+ datasets from peer-reviewed publications. The REST API v2 at https://www.ebi.ac.uk/pride/ws/archive/v2/ provides project discovery, file listing, peptide/PSM identification retrieval, and protein-level evidence — all without authentication. Data types include RAW files, peak lists (mzML, MGF), PRIDE XML result files, and processed identification tables.

When to Use

Finding published proteomics datasets by organism, tissue, disease keyword, or instrument type for meta-analysis or benchmarking
Downloading raw mass spectrometry data (RAW, mzML) or peak files (MGF) from a specific PRIDE project accession
Retrieving peptide identification tables with sequence, modification, and confidence score for a project
Querying protein-level evidence (PSMs, unique peptides) for a protein of interest across PRIDE projects
Checking whether a protein has experimental proteomics evidence in a specific tissue or disease context
Building training datasets of confident peptide-spectrum matches (PSMs) for proteomics ML applications

import requests import pandas as pd from pathlib import Path PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2" def build_download_manifest(accession: str, file_types: list = None, output_dir: str = ".") -> pd.DataFrame: """Build a download manifest for PRIDE project files. Parameters ---------- accession : str PRIDE accession. file_types : list List of file types to include (e.g., ['RAW', 'PEAK', 'RESULT']). None = include all. output_dir : str Local directory for downloaded files. """ params = {"pageSize": 200, "page": 0} records, page = [], 0 while True: params["page"] = page r = requests.get( f"{PRIDE_BASE}/projects/{accession}/files", params=params, headers={"Accept": "application/json"}, timeout=30 ) r.raise_for_status() data = r.json() batch = data.get("_embedded", {}).get("files", []) for f in batch: ftype = f.get("fileCategory", {}).get("value", "OTHER") if file_types and ftype not in file_types: continue url = next( (loc["value"] for loc in f.get("publicFileLocations", []) if loc.get("name") == "FTP Protocol"), f.get("publicFileLocations", [{}])[0].get("value", "") ) records.append({ "file_name": f.get("fileName"), "file_type": ftype, "size_mb": round(f.get("fileSize", 0) / 1e6, 1), "url": url, "local_path": str(Path(output_dir) / f.get("fileName", "unknown")), }) total_pages = data.get("page", {}).get("totalPages", 1) page += 1 if page >= total_pages or not batch: break df = pd.DataFrame(records) return df manifest = build_download_manifest( "PXD004131", file_types=["RAW", "RESULT"], output_dir="/data/pride/PXD004131" ) print(f"Files to download: {len(manifest)}") print(f"Total size: {manifest['size_mb'].sum():.0f} MB") print(manifest.groupby("file_type")[["file_name", "size_mb"]].head(3).to_string()) # Export wget batch file wget_lines = [f"wget -P /data/pride/PXD004131 '{row.url}'" for _, row in manifest.iterrows() if row.url] with open(f"download_{manifest['file_type'].iloc[0] if len(manifest) else 'files'}.sh", "w") as fh: fh.write("\n".join(wget_lines)) print(f"\nExported wget script with {len(wget_lines)} download commands")

File type	Description	Format examples
`RAW`	Unprocessed instrument output	.raw (Thermo), .d (Bruker/Agilent)
`PEAK`	Centroided or deconvoluted spectra	mzML, mzXML, MGF
`RESULT`	Identification results from search engine	mzIdentML, PRIDE XML, MaxQuant txt
`FASTA`	Protein sequence databases used in search	.fasta
`OTHER`	Supplementary files (scripts, tables)	.txt, .xlsx, .csv

Parameter	Endpoint	Default	Range / Options	Effect
`keyword`	`GET /projects`	—	free-text string	Full-text search across title and description
`organisms`	`GET /projects`	—	organism name string (e.g., `"Homo sapiens"`)	Filter projects by organism
`tissues`	`GET /projects`	—	tissue name string (e.g., `"liver"`)	Filter projects by tissue
`diseases`	`GET /projects`	—	disease keyword	Filter projects by disease annotation
`instruments`	`GET /projects`	—	instrument name (e.g., `"Orbitrap"`)	Filter projects by MS instrument
`fileType`	`GET /projects/{acc}/files`	all	`RAW`, `PEAK`, `RESULT`, `FASTA`, `OTHER`	Filter files by category
`pageSize`	all list endpoints	`20`	`1`–`100`	Results per page
`page`	all list endpoints	`0`	non-negative integer	0-indexed page for pagination
`projectAccessions`	`GET /peptides`, `/psms`, `/proteins`	—	`PXD######` string	Restrict identifications to a specific project
`proteinAccession`	`GET /peptides`, `/psms`, `/proteins`	—	UniProt accession	Filter by protein
`peptideSequence`	`GET /peptides`, `/psms`	—	amino acid sequence string

Problem	Cause	Solution
`HTTP 404` on project lookup	Accession not found or not public	Verify accession format (`PXD######`); some datasets are under embargo until publication
`HTTP 429 Too Many Requests`	Exceeded ~50 req/min rate limit	Add `time.sleep(1.2)` between requests; implement exponential backoff for bursts
Empty `_embedded` object	No results match the query	Broaden search terms; check organism spelling (exact match required, e.g., `"Homo sapiens"`)
Empty peptide/PSM results	Project has no identification data loaded	Newer projects may not yet have identifications indexed; use `RESULT` file download instead
Download URL is empty string	File not yet available on FTP	Check `publicFileLocations` list for alternative URLs; some files are HTTPS-only
Very large file manifest	Project has hundreds of files	Use `fileType` filter to restrict to relevant types; build a manifest before downloading
`ConnectionError` or `ReadTimeout`	Transient EBI infrastructure issue	Retry after 60 seconds; EBI services occasionally have brief maintenance windows

PRIDE Database

PRIDE Database

Overview

When to Use

Prerequisites

Quick Start

Core API

Query 1: Project Search

Query 2: Project Details

Query 3: Project Files

Query 4: Peptide Identifications

Query 5: PSM (Peptide-Spectrum Match) Retrieval

Query 6: Protein Evidence

Key Concepts

PRIDE File Types

Accession Formats

Common Workflows

Workflow 1: Disease Proteomics Dataset Discovery

Workflow 2: File Download Manager for a Project

Workflow 3: Protein Evidence Summary Across Projects

Key Parameters

Best Practices

Common Recipes

Recipe: Quick Project File Summary

Recipe: Check If a Protein Has PRIDE Evidence

Recipe: Find Projects with Specific PTM Data

Troubleshooting

References

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags

PRIDE Database

PRIDE Database

Overview

When to Use

Prerequisites

Quick Start

Core API

Query 1: Project Search

Query 2: Project Details

Query 3: Project Files

Query 4: Peptide Identifications

Query 5: PSM (Peptide-Spectrum Match) Retrieval

Query 6: Protein Evidence

Key Concepts

PRIDE File Types

Accession Formats

Pagination

Common Workflows

Workflow 1: Disease Proteomics Dataset Discovery

Workflow 2: File Download Manager for a Project

Workflow 3: Protein Evidence Summary Across Projects

Key Parameters

Best Practices

Common Recipes

Recipe: Quick Project File Summary

Recipe: Check If a Protein Has PRIDE Evidence

Recipe: Find Projects with Specific PTM Data

Troubleshooting

Related Skills

References

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags