Prepare protein datasets for training
Verify environment:
source /home/yeopjin/orcd/pool/init_protein_llm.sh
List available datasets:
python src/data/download.py --dataset list
Download datasets (choose as needed):
# IPD PDB sample (recommended for training)
python src/data/download.py --dataset ipd_pdb_sample --output_dir ./data
# Swiss-Prot sequences
python src/data/download.py --dataset swissprot --output_dir ./data
# Mol-Instructions (HuggingFace)
python -c "from datasets import load_dataset; load_dataset('zjunlp/Mol-Instructions', 'Protein')"
Verify downloads:
ls -lh ./data/
Test dataloaders:
from src.data import get_pdb_dataloader
dl = get_pdb_dataloader("./data/pdb_2021aug02_sample", batch_size=4)
batch = next(iter(dl))
print(f"Sequences: {len(batch['sequence'])}")
print(f"Coords shape: {batch['coords'].shape}")
Preprocess for training (if needed):
python scripts/preprocess_data.py \
--input_dir ./data/pdb_2021aug02_sample \
--output_dir ./data/processed \
--max_length 512 \
--max_resolution 3.0