Query and download biological sequence data from NCBI using the datasets and dataformat CLI tools. Use when downloading genomes, genes, or viral sequences, looking up gene/genome metadata, or converting NCBI data to TSV/Excel.
<essential_principles>
Two CLI tools work together:
datasets -- queries and downloads data from NCBI (genomes, genes, viruses, taxonomy)dataformat -- converts JSON Lines metadata into TSV or ExcelCritical rules:
--as-json-lines when piping datasets summary to dataformat. Without it, the pipe fails.'Drosophila melanogaster'GCF_ = RefSeq, GCA_ = GenBank--api-key for 10 req/sec.ncbi_dataset/data/ contains sequences and assembly_data_report.jsonl metadata.Before any workflow, check that the CLI tools are installed:
<intake> What would you like to do? <routing> | Response | Workflow | |----------|----------| | 1, "download", "get genome", "get gene" | `workflows/download.md` | | 2, "look up", "summary", "metadata", "info", "search" | `workflows/summarize.md` | | 3, "format", "tsv", "excel", "convert", "dataformat" | `workflows/format-output.md` | | 4, "batch", "dehydrated", "large", "many genomes" | `workflows/batch-download.md` | | 5, "cytoband", "cytological", "band", "polytene" | `workflows/cytoband-download.md` | | 6, other | Clarify intent, then route |which datasets && which dataformat
If either is missing, read references/installation.md and walk the user through installation before proceeding. Do not skip this step.
</essential_principles>
<quick_start> Genome metadata to TSV:
datasets summary genome taxon 'Drosophila melanogaster' --reference --as-json-lines | \
dataformat tsv genome --fields accession,assminfo-name,organism-name,annotinfo-name
Download a reference genome with annotations:
datasets download genome taxon 'Drosophila melanogaster' --reference \
--include genome,gff3,protein --filename dmel_ref.zip
Gene lookup:
datasets summary gene symbol Ubx --taxon 'Drosophila melanogaster' --as-json-lines | \
dataformat tsv gene --fields symbol,gene-id,description,gene-type,chromosomes,annotation-genomic-range-accession,annotation-genomic-range-range-start,annotation-genomic-range-range-stop,annotation-genomic-range-range-orientation,synonyms
</quick_start>
Wait for response before proceeding. </intake>
Intent-based routing (if user provides clear intent without selecting menu):
workflows/download.mdworkflows/summarize.mdworkflows/format-output.mdworkflows/batch-download.mdworkflows/cytoband-download.mdworkflows/cytoband-download.mdAfter reading the workflow, follow it exactly. </routing>
<reference_index>
All domain knowledge in references/:
CLI Reference: commands.md (full command tree, flags, subcommands) Field Names: fields.md (dataformat field names for --fields flag) Installation: installation.md (install methods for macOS, Linux, conda) Drosophila Cytobands: dmel-cytobands.tsv (593 letter-level bands) and dmel-cytobands-subdivisions.tsv (4,007 individual polytene bands) → R6 coordinates. Try subdivisions first. </reference_index>
<workflows_index>
| Workflow | Purpose |
|---|---|
| download.md | Download genome, gene, or virus data as ZIP packages |
| summarize.md | Query metadata without downloading (summary to stdout) |
| format-output.md | Convert JSON Lines to TSV or Excel with dataformat |
| batch-download.md | Large-scale downloads using dehydrated workflow |
| cytoband-download.md | Download GenBank data for a Drosophila cytological band |
| </workflows_index> |
<success_criteria>
ncbi_dataset/data/