Name: Exploratory Data Analysis
Author: wu-yc

Skills suchen.../

User: "Analyze data.fastq"
→ Extension: .fastq
→ Category: bioinformatics_genomics
→ Format: FASTQ Format (sequence data with quality scores)
→ Reference: references/bioinformatics_genomics_formats.md

# The script automatically:
# 1. Detects file type
# 2. Loads reference information
# 3. Performs format-specific analysis
# 4. Generates markdown report

python scripts/eda_analyzer.py <filepath> [output.md]

### .pdb - Protein Data Bank
**Description:** Standard format for 3D structures of biological macromolecules
**Typical Data:** Atomic coordinates, residue information, secondary structure
**Use Cases:** Protein structure analysis, molecular visualization, docking
**Python Libraries:**
- `Biopython`: `Bio.PDB`
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
**EDA Approach:**
- Structure validation (bond lengths, angles)
- B-factor distribution
- Missing residues detection
- Ramachandran plots

Search by extension: Use grep to find the specific format

import re
with open('references/chemistry_molecular_formats.md', 'r') as f:
    content = f.read()
    pattern = r'### \.pdb[^#]*?(?=###|\Z)'
    match = re.search(pattern, content, re.IGNORECASE | re.DOTALL)

Extract relevant sections: Don't load entire reference files into context unnecessarily
Cache format info: If analyzing multiple files of the same type, reuse the format information

# User provides: "Analyze reads.fastq"

# 1. Detect file type
extension = '.fastq'
category = 'bioinformatics_genomics'

# 2. Read reference info
# Search references/bioinformatics_genomics_formats.md for "### .fastq"

# 3. Perform analysis
from Bio import SeqIO
sequences = list(SeqIO.parse('reads.fastq', 'fastq'))
# Calculate: read count, length distribution, quality scores, GC content

# 4. Generate report
# Include: format description, analysis results, QC recommendations

# 5. Save as: reads_eda_report.md

# User provides: "Explore experiment_results.csv"

# 1. Detect: .csv → general_scientific

# 2. Load reference for CSV format

# 3. Analyze
import pandas as pd
df = pd.read_csv('experiment_results.csv')
# Dimensions, dtypes, missing values, statistics, correlations

# 4. Generate report with:
# - Data structure
# - Missing value patterns
# - Statistical summaries
# - Correlation matrix
# - Outlier detection results

# 5. Save report

# User provides: "Analyze cells.nd2"

# 1. Detect: .nd2 → microscopy_imaging (Nikon format)

# 2. Read reference for ND2 format
# Learn: multi-dimensional (XYZCT), requires nd2reader

# 3. Analyze
from nd2reader import ND2Reader
with ND2Reader('cells.nd2') as images:
    # Extract: dimensions, channels, timepoints, metadata
    # Calculate: intensity statistics, frame info

# 4. Generate report with:
# - Image dimensions (XY, Z-stacks, time, channels)
# - Channel wavelengths
# - Pixel size and calibration
# - Recommendations for image analysis

# 5. Save report

Exploratory Data Analysis | Skills Pool

Exploratory Data Analysis

Exploratory Data Analysis

Overview

When to Use This Skill

Supported File Categories

1. Chemistry and Molecular Formats (60+ extensions)

2. Bioinformatics and Genomics Formats (50+ extensions)

3. Microscopy and Imaging Formats (45+ extensions)

4. Spectroscopy and Analytical Chemistry Formats (35+ extensions)

5. Proteomics and Metabolomics Formats (30+ extensions)

6. General Scientific Data Formats (30+ extensions)

Workflow

Step 1: File Type Detection

Step 2: Load Format-Specific Information

Step 3: Perform Data Analysis

Step 4: Generate Comprehensive Report

Required Sections:

Template Location

Step 5: Save Report

Detailed Format References

Reference File Structure

Best Practices

Reading Reference Files

Data Analysis

Report Generation

Examples

Example 1: Analyzing a FASTQ file

Example 2: Analyzing a CSV dataset

Example 3: Analyzing microscopy data

Troubleshooting

Missing Libraries

Deep Research

Data Analyst

Academic Researcher

Data Scientist

Biopython

Binary Analysis Patterns