Python Introduction for Bioinformatics
| Strength | What it means for bioinformatics |
|---|---|
| Readable syntax | Accessible to biologists without a CS background |
| Rich ecosystem | Libraries for sequence analysis, statistics, plotting, ML |
| Community | Large bioinformatics community; maintained packages |
| Interoperability | Easy to call external tools (BLAST, samtools, HMMER) |
| Rapid prototyping | Test a hypothesis quickly, then scale up |
Count nucleotides in a DNA string like "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC":
# Using .count()
a_count = dna.count("A")
c_count = dna.count("C")
g_count = dna.count("G")
t_count = dna.count("T")
print(f" A: {a_count}")
print(f" C: {c_count}")
print(f" G: {g_count}")
print(f" T: {t_count}")
print(f" Total: {a_count + c_count + g_count + t_count}")
Key concepts:
len(dna) — number of characters in a stringdna.count("A") — occurrences of substringf"..." — f-strings embed variables directly: f"GC = {gc:.2f}%"counts = {"A": 0, "C": 0, "G": 0, "T": 0}
for nucleotide in dna:
if nucleotide in counts:
counts[nucleotide] += 1
for nuc, count in counts.items():
percentage = count / len(dna) * 100
print(f" {nuc}: {count:>3} ({percentage:.1f}%)")
GC content = fraction of G + C nucleotides. Key properties:
gc_count = dna.count("G") + dna.count("C")
gc_content = gc_count / len(dna) * 100
print(f"GC content: {gc_content:.2f}%")
if gc_content < 40:
print("AT-rich (low GC).")
elif gc_content > 60:
print("GC-rich.")