Name: Data Acquisition
Author: bingli8899

Data Acquisition | Skills Pool

# Entrez Direct — adapt terms per group and markers
esearch -db nuccore -query '"<group>" AND ("<marker1>" OR "<marker2>")' \
  | efetch -format docsum | xtract -pattern DocumentSummary \
    -element AccessionVersion Organism Title Length

esearch -db sra -query '"<group>" AND ("genome skimming" OR "WGS" \
  OR "target enrichment" OR "transcriptome")' \
  | efetch -format docsum | xtract -pattern DocumentSummary \
    -element Run@acc ScientificName LibraryStrategy LibrarySource

"<group name>" AND (phylogeny OR phylogenetic OR phylogenom* OR plastome
  OR "genome skimming" OR "target enrichment" OR transcriptome)
"<group name>" AND (rbcL OR matK OR trnL OR ITS OR psbA OR nrITS)

reports/planA/
reports/planB/

df -h .        # available space in current working directory
df -h data/    # or wherever data will be stored

# Get file size for each SRA run from metadata
esearch -db sra -query "<SRR_ID>" | efetch -format runinfo \
  | cut -d',' -f7,10,11   # size_MB, LibraryStrategy, ScientificName

# Or use SRA toolkit
vdb-dump --info <SRR_ID> | grep -i "size"

Condition	Mode	Description
Available storage > estimated raw data × 1.5	Bulk mode	Download all → assemble all → keep assemblies
Available storage ≤ estimated raw data × 1.5	Streaming mode	Download one sample → assemble → keep assembly → delete raw → next

# Download all SRA runs first
for SRR_ID in $(cat sra_list.txt); do
  prefetch $SRR_ID -O data/raw/
  fasterq-dump data/raw/$SRR_ID/ -O data/raw/ --split-files
done
# Then hand off to assembly

for SRR_ID in $(cat sra_list.txt); do
  echo "Processing $SRR_ID..."

  # 1. Download
  prefetch $SRR_ID -O data/raw/
  fasterq-dump data/raw/$SRR_ID/ -O data/raw/ --split-files

  # 2. Assemble immediately (adapt command to assembly strategy)
  get_organelle_from_reads.py \
    -1 data/raw/${SRR_ID}_1.fastq -2 data/raw/${SRR_ID}_2.fastq \
    -F embplant_pt -o data/assembled/${SRR_ID}/ -t 8

  # 3. Verify assembly output exists before deleting raw data
  if [ -f "data/assembled/${SRR_ID}/*.fasta" ]; then
    rm -rf data/raw/$SRR_ID/
    rm -f data/raw/${SRR_ID}*.fastq
    echo "$SRR_ID: assembly complete, raw reads removed"
  else
    echo "$SRR_ID: assembly FAILED — raw reads retained for debugging"
  fi

  # 4. Log storage after each sample
  df -h . | tail -1
done

efetch -db nuccore -id <accession_list> -format fasta > sequences.fasta

<Genus>_<species>_<accession>_<marker>.fasta
# e.g. Zingiber_officinale_MN123456_matK.fasta
#      Curcuma_longa_SRR9876543_WGS.fastq

# Data Acquisition Report
Date: YYYY-MM-DD

## Data Landscape Survey
[Summary of what was found in GenBank and SRA: total records,
 data types available, notable gaps or richly sampled lineages]

## Coverage Matrix
[Taxa × markers/data-type table]

## Sampling Plans Proposed
[Each plan: taxon count, marker count, missing data %, route, trade-off]

## Plan(s) Selected
[Which plan(s) the researcher chose and why]

## Search Queries Used
[Exact queries run against each database]

## Inclusion / Exclusion Decisions
[Sequences included or excluded with reasoning]

## Storage Assessment
- Available storage at start: [X GB]
- Estimated raw SRA data size: [X GB]
- Download mode selected: bulk / streaming — justification

## Streaming Mode Log (if applicable)
| SRR ID | Assembly output | Raw reads deleted | Notes |
|--------|----------------|-----------------|-------|

## Data Quality Notes
[Suspect sequences, misidentified taxa, truncated records flagged]

## Software Versions
[Entrez Direct version, SRA Toolkit version]

## Next Module
assembly (if SRA raw reads) / alignment (if assembled)

\begin{table}[h]
\caption{Sequence data used in this study}
\begin{tabular}{lllllll}
\hline
Taxon & Accession/Run & Database & Data type & Marker/Library & Length/Reads & Download date \\
\hline
Zingiber officinale & MN123456 & GenBank & Assembled & matK & 873 bp & 2026-04-13 \\
Curcuma longa       & SRR9876543 & SRA   & WGS       & —    & 4.2M reads & 2026-04-13 \\
\hline
\end{tabular}
\end{table}

Script	Purpose
`download_genbank.sh`	Survey GenBank and download sequences per marker; enforces `Genus_species_accession_marker.fasta` naming
`download_sra.sh`	Storage-aware SRA download; auto-selects bulk vs. streaming mode; calls assembly script in streaming mode

# Survey only (no download)
bash scripts/data/download_genbank.sh \
  -g "Zingiberaceae" -m "matK,rbcL,ITS" -o data/genbank -s

# Download GenBank sequences (≤300 per marker)
bash scripts/data/download_genbank.sh \
  -g "Zingiberaceae" -m "matK,rbcL,ITS,psbA" -o data/genbank -n 300

# SRA download (auto mode — detects bulk vs. streaming)
bash scripts/data/download_sra.sh \
  -l sra_list.txt -o data/raw

Mistake	Fix
Downloading before surveying	Survey first — data landscape shapes which plan is even possible
Hardcoding a fixed number of taxa or markers	Let the coverage matrix reveal natural thresholds; don't pre-decide plan shapes
Mixing naming conventions from different databases	Standardize to `Genus_species_accession_marker` at download time
Treating missing data as a binary pass/fail	Estimate % per plan — some missing data is acceptable and expected
Downloading SRA reads without checking library strategy	Confirm WGS/genome skimming vs. amplicon vs. RNA-seq before routing to assembly
Forgetting to record Entrez Direct and SRA Toolkit versions	Run `edirect -version` and `fasterq-dump --version`; log both in report
Skipping storage estimation before bulk SRA download	Large WGS datasets can exceed hundreds of GB; always estimate first
Deleting raw reads before verifying assembly output	Check file exists and is non-empty before `rm`; a failed assembly with deleted reads cannot be recovered
Using bulk mode when storage is borderline	Apply the 1.5× buffer conservatively; streaming mode is safer and produces the same result

Taxon	rbcL	matK	ITS	Plastome (SRA)	HybSeq (SRA)
Sp. A	✓	✓	✓	—	✓
Sp. B	✓	—	✓	✓	—
...

Data Acquisition

Acquiring Phylogenetic Sequence Data

Overview

Process

Step 1 — Read the research design report

Step 2 — Data landscape survey

Data Acquisition

Acquiring Phylogenetic Sequence Data

Overview

Process

Step 1 — Read the research design report

Step 2 — Data landscape survey

Step 3 — Build the coverage matrix

Step 4 — Generate sampling plans

Step 5 — Human approves plan(s)

Step 6 — Estimate storage and choose download mode

Step 7 — QC check

Reports

`reports/data-acquisition_YYYY-MM-DD.md` — narrative

`reports/data-acquisition_YYYY-MM-DD.tex` — LaTeX accession table

Scripts

Common Mistakes

Nanoclaw Repl

Bioinformatics

Smart Explore

Vector Database Engineer

Skin Health Analyzer

Scanpy

Data Acquisition

Acquiring Phylogenetic Sequence Data

Overview

Process

Step 1 — Read the research design report

Step 2 — Data landscape survey

Data Acquisition

Acquiring Phylogenetic Sequence Data

Overview

Process

Step 1 — Read the research design report

Step 2 — Data landscape survey

Step 3 — Build the coverage matrix

Step 4 — Generate sampling plans

Step 5 — Human approves plan(s)

Step 6 — Estimate storage and choose download mode

Step 7 — QC check

Reports

reports/data-acquisition_YYYY-MM-DD.md — narrative

reports/data-acquisition_YYYY-MM-DD.tex — LaTeX accession table

Scripts

Common Mistakes

Nanoclaw Repl

Bioinformatics

Smart Explore

Vector Database Engineer

Skin Health Analyzer

Scanpy

`reports/data-acquisition_YYYY-MM-DD.md` — narrative

`reports/data-acquisition_YYYY-MM-DD.tex` — LaTeX accession table