Process CRISPR screening data to identify essential genes and hit candidates. Performs quality control, statistical analysis (RRA), and hit calling for pooled CRISPR screens including viability screens and drug resistance/sensitivity studies.
Analyze pooled CRISPR screening data to identify essential genes, drug resistance/sensitivity candidates, and screen quality metrics. Supports Robust Rank Aggregation (RRA) analysis, quality control assessment, and hit identification for functional genomics studies.
Key Capabilities:
✅ Use this skill when:
❌ Do NOT use when:
Related Skills:
crispr-grna-designer, fastqc-report-interpretergo-kegg-enrichment, pathway-visualization, hit-validation-plannerUpstream Skills:
crispr-grna-designer: Design sgRNA libraries before screening; validate library compositionfastqc-report-interpreter: Assess sequencing quality before CRISPR screen analysisalignment-quality-checker: Verify sgRNA alignment rates and mapping qualityDownstream Skills:
go-kegg-enrichment: Perform pathway enrichment on identified hit genespathway-visualization: Visualize hits in pathway contextshit-validation-planner: Design follow-up experiments for candidate genesgene-essentiality-predictor: Compare screen results with known essential gene databasesComplete Workflow:
Library Design (crispr-grna-designer) → Transduction → Sequencing → fastqc-report-interpreter → crispr-screen-analyzer → go-kegg-enrichment → Hit Validation
Assess CRISPR screen quality using established metrics including Gini index, read depth, and sgRNA dropout rates.
from scripts.main import CRISPRScreenAnalyzer
# Initialize analyzer with count matrix and sample annotations
analyzer = CRISPRScreenAnalyzer(
counts_file="sgrna_counts.txt",
samplesheet="samples.csv"
)
# Calculate QC metrics
qc_results = analyzer.qc_metrics()
# Review key metrics
print("Quality Control Metrics:")
print(f"Total reads per sample:")
for sample, reads in qc_results['total_reads'].items():
print(f" {sample}: {reads:,} reads")
print(f"\nGini index (library representation):")
for sample, gini in qc_results['gini_index'].items():
status = "✅ Good" if gini < 0.3 else "⚠️ Check" if gini < 0.4 else "❌ Poor"
print(f" {sample}: {gini:.3f} {status}")
print(f"\nZero-count sgRNAs (potential dropout):")
for sample, zeros in qc_results['zero_count_sgrnas'].items():
pct = (zeros / len(analyzer.counts)) * 100
print(f" {sample}: {zeros} ({pct:.1f}%)")
QC Metrics Explained:
| Metric | Target Range | Interpretation |
|---|---|---|
| Gini Index | <0.3 | Measures library evenness; lower = more uniform |
| Total Reads | >10M per sample | Sufficient depth for statistical power |
| Zero-count sgRNAs | <5% | Acceptable dropout; higher indicates library loss |
| Read Distribution | Log-normal | Should follow expected distribution |
Best Practices:
Common Issues and Solutions:
Issue: High Gini index (>0.4)
Issue: Excessive zero-count sgRNAs (>10%)
Calculate log2 fold changes between treatment and control conditions to identify enriched or depleted sgRNAs.
from scripts.main import CRISPRScreenAnalyzer
analyzer = CRISPRScreenAnalyzer("counts.txt", "samples.csv")
# Define sample groups
control_samples = ["Control_1", "Control_2", "Control_3"]
treatment_samples = ["Drug_1", "Drug_2", "Drug_3"]
# Calculate log fold changes
lfc = analyzer.calculate_lfc(control_samples, treatment_samples)
# Analyze distribution
print("Log Fold Change Statistics:")
print(f" Mean: {lfc.mean():.3f}")
print(f" Std: {lfc.std():.3f}")
print(f" Max: {lfc.max():.3f}")
print(f" Min: {lfc.min():.3f}")
# Identify extreme changes
strong_depletion = lfc[lfc < -2] # Strong negative selection
strong_enrichment = lfc[lfc > 2] # Strong positive selection
print(f"\nStrongly depleted sgRNAs: {len(strong_depletion)}")
print(f"Strongly enriched sgRNAs: {len(strong_enrichment)}")
LFC Calculation:
lfc = log2((treatment_mean + 1) / (control_mean + 1))
Interpretation:
| LFC Range | Interpretation | Biological Meaning |
|---|---|---|
| LFC < -2 | Strong depletion | Essential gene or drug sensitivity |
| LFC -2 to -1 | Moderate depletion | Moderate effect |
| LFC -1 to 1 | No change | No significant effect |
| LFC 1 to 2 | Moderate enrichment | Moderate resistance |
| LFC > 2 | Strong enrichment | Resistance gene or suppressor |
Best Practices:
Common Issues and Solutions:
Issue: Skewed LFC distribution
Issue: Extreme outliers
Perform statistical analysis to identify significantly enriched or depleted sgRNAs using z-score and FDR correction.
from scripts.main import CRISPRScreenAnalyzer
analyzer = CRISPRScreenAnalyzer("counts.txt", "samples.csv")
# Calculate LFC first
lfc = analyzer.calculate_lfc(
control_samples=["Ctrl_1", "Ctrl_2"],
treatment_samples=["Treat_1", "Treat_2"]
)
# Perform RRA analysis
results = analyzer.rra_analysis(lfc, fdr_threshold=0.05)
# Review top hits
print("Top 10 Most Significant sgRNAs:")
top_hits = results.nsmallest(10, 'fdr')
print(top_hits[['sgrna', 'lfc', 'pvalue', 'fdr']].to_string(index=False))
# Summary statistics
print(f"\nTotal sgRNAs tested: {len(results)}")
print(f"Significant at FDR < 0.05: {sum(results['fdr'] < 0.05)}")
print(f"Significant depletions: {sum((results['fdr'] < 0.05) & (results['lfc'] < 0))}")
print(f"Significant enrichments: {sum((results['fdr'] < 0.05) & (results['lfc'] > 0))}")
RRA Analysis Steps:
z = (lfc - mean) / stdStatistical Output:
| Column | Description | Usage |
|---|---|---|
sgrna | sgRNA identifier | Mapping to genes |
lfc | Log fold change | Effect size |
pvalue | Raw p-value | Statistical significance |
fdr | Adjusted p-value (FDR) | Multiple testing correction |
Best Practices:
Common Issues and Solutions:
Issue: No significant hits despite visible effects
Issue: Too many significant hits
Apply statistical and biological thresholds to identify candidate genes for follow-up validation.
from scripts.main import CRISPRScreenAnalyzer
analyzer = CRISPRScreenAnalyzer("counts.txt", "samples.csv")
lfc = analyzer.calculate_lfc(["Ctrl_1", "Ctrl_2"], ["Treat_1", "Treat_2"])
results = analyzer.rra_analysis(lfc)
# Identify hits with multiple thresholds
threshold_configs = [
{"fdr": 0.05, "lfc": 1.0, "name": "Standard"},
{"fdr": 0.01, "lfc": 1.5, "name": "Stringent"},
{"fdr": 0.1, "lfc": 0.5, "name": "Permissive"}
]
for config in threshold_configs:
hits = analyzer.identify_hits(
results,
fdr_threshold=config['fdr'],
lfc_threshold=config['lfc']
)
depletions = hits[hits['lfc'] < 0]
enrichments = hits[hits['lfc'] > 0]
print(f"\n{config['name']} (FDR<{config['fdr']}, |LFC|>{config['lfc']}):")
print(f" Total hits: {len(hits)}")
print(f" Depletions: {len(depletions)}")
print(f" Enrichments: {len(enrichments)}")
# Save hits for downstream analysis
standard_hits = analyzer.identify_hits(results, fdr_threshold=0.05, lfc_threshold=1.0)
standard_hits.to_csv("hits_standard.csv", index=False)
Hit Classification:
| Category | Criteria | Biological Interpretation |
|---|---|---|
| Essential | FDR<0.05, LFC<-1 | Required for cell viability |
| Drug Sensitive | FDR<0.05, LFC<-1 | Synthetic lethal with treatment |
| Drug Resistant | FDR<0.05, LFC>1 | Confers resistance to treatment |
| Suppressor | FDR<0.05, LFC>1 | Suppresses phenotype of interest |
Best Practices:
Common Issues and Solutions:
Issue: Single sgRNA hits
Issue: Off-target effects dominating
Aggregate sgRNA-level results to gene-level statistics for biological interpretation.
import pandas as pd
from scripts.main import CRISPRScreenAnalyzer
analyzer = CRISPRScreenAnalyzer("counts.txt", "samples.csv")
lfc = analyzer.calculate_lfc(["Ctrl_1", "Ctrl_2"], ["Treat_1", "Treat_2"])
results = analyzer.rra_analysis(lfc)
# Add gene annotations (example mapping)
sgrna_to_gene = pd.read_csv("library_annotation.csv") # sgRNA, Gene columns
results_with_gene = results.merge(sgrna_to_gene, on='sgrna')
# Aggregate to gene level
gene_results = results_with_gene.groupby('Gene').agg({
'lfc': 'mean', # Average LFC across sgRNAs
'pvalue': 'min', # Best p-value
'fdr': 'min', # Best FDR
'sgrna': 'count' # Number of sgRNAs
}).rename(columns={'sgrna': 'sgrna_count'})
# Filter genes with multiple sgRNAs
gene_results = gene_results[gene_results['sgrna_count'] >= 2]
# Identify gene-level hits
gene_hits = gene_results[
(gene_results['fdr'] < 0.05) &
(abs(gene_results['lfc']) > 1.0)
]
print(f"Gene-level hits: {len(gene_hits)}")
print("\nTop 10 hits:")
print(gene_hits.nsmallest(10, 'fdr')[['lfc', 'pvalue', 'fdr', 'sgrna_count']])
Gene Aggregation Methods:
| Method | Description | Best For |
|---|---|---|
| Mean LFC | Average across sgRNAs | General hit calling |
| Best FDR | Most significant sgRNA | Conservative approach |
| Second-best | Second most significant | Reduces outlier effects |
| STARS/RRA | Rank-based aggregation | Standard CRISPR analysis |
Best Practices:
Common Issues and Solutions:
Issue: Discordant sgRNAs for same gene
Compare CRISPR screen results across multiple treatment conditions or time points.
from scripts.main import CRISPRScreenAnalyzer
analyzer = CRISPRScreenAnalyzer("counts.txt", "samples.csv")
# Define multiple comparisons
comparisons = {
"Drug_A": {
"control": ["DMSO_1", "DMSO_2"],
"treatment": ["DrugA_1", "DrugA_2"]
},
"Drug_B": {
"control": ["DMSO_1", "DMSO_2"],
"treatment": ["DrugB_1", "DrugB_2"]
},
"Combination": {
"control": ["DMSO_1", "DMSO_2"],
"treatment": ["Combo_1", "Combo_2"]
}
}
# Analyze all conditions
all_results = {}
for comp_name, samples in comparisons.items():
lfc = analyzer.calculate_lfc(samples['control'], samples['treatment'])
results = analyzer.rra_analysis(lfc)
hits = analyzer.identify_hits(results)
all_results[comp_name] = {
'lfc': lfc,
'results': results,
'hits': hits
}
print(f"{comp_name}: {len(hits)} hits")
# Find common hits across conditions
common_hits = set(all_results['Drug_A']['hits'].index)
for comp in ['Drug_B', 'Combination']:
common_hits &= set(all_results[comp]['hits'].index)
print(f"\nCommon hits across all conditions: {len(common_hits)}")
# Compare LFC correlations between conditions
import matplotlib.pyplot as plt
lfc_drugA = all_results['Drug_A']['lfc']
lfc_drugB = all_results['Drug_B']['lfc']
correlation = lfc_drugA.corr(lfc_drugB)
print(f"\nCorrelation between Drug A and Drug B: {correlation:.3f}")
Multi-Condition Analysis:
| Comparison Type | Question Addressed | Interpretation |
|---|---|---|
| Drug vs Control | What genes mediate drug response? | Resistance/sensitivity mechanisms |
| Condition A vs B | Differential genetic dependencies | Context-specific essentiality |
| Time-course | How does genetic dependency change? | Temporal dynamics |
| Cell line comparison | Cell-type specific dependencies | Lineage-specific vulnerabilities |
Best Practices:
Common Issues and Solutions:
Issue: High variability between replicates
From count matrix to hit identification:
# Step 1: Run QC assessment
python scripts/main.py --counts sgrna_counts.txt --samples samples.csv --output qc_results
# Step 2: Perform differential analysis
python scripts/main.py \
--counts sgrna_counts.txt \
--samples samples.csv \
--control "Ctrl_1,Ctrl_2,Ctrl_3" \
--treatment "Drug_1,Drug_2,Drug_3" \
--output drug_screen \
--fdr 0.05
# Step 3: Review results
cat drug_screen_sgrna_results.csv | head -20
Python API Usage:
from scripts.main import CRISPRScreenAnalyzer
import pandas as pd
def analyze_crispr_screen(
counts_file: str,
samplesheet: str,
control_samples: list,
treatment_samples: list,
output_prefix: str,
fdr_threshold: float = 0.05,
lfc_threshold: float = 1.0
) -> dict:
"""
Complete CRISPR screen analysis workflow.
"""
# Initialize analyzer
analyzer = CRISPRScreenAnalyzer(counts_file, samplesheet)
print(f"Loaded {analyzer.counts.shape[0]} sgRNAs x {analyzer.counts.shape[1]} samples")
# Quality control
print("\n1. Quality Control Assessment...")
qc = analyzer.qc_metrics()
# Check QC status
qc_pass = all(gini < 0.4 for gini in qc['gini_index'].values())
if not qc_pass:
print("⚠️ Warning: High Gini index detected - check library representation")
# Calculate fold changes
print("\n2. Calculating log fold changes...")
lfc = analyzer.calculate_lfc(control_samples, treatment_samples)
# Statistical analysis
print("\n3. Running RRA analysis...")
results = analyzer.rra_analysis(lfc, fdr_threshold)
# Identify hits
print("\n4. Identifying significant hits...")
hits = analyzer.identify_hits(results, fdr_threshold, lfc_threshold)
# Categorize hits
depletions = hits[hits['lfc'] < 0]
enrichments = hits[hits['lfc'] > 0]
# Save results
results.to_csv(f"{output_prefix}_sgrna_results.csv", index=False)
hits.to_csv(f"{output_prefix}_hits.csv", index=False)
# Compile summary
summary = {
'total_sgrnas': len(results),
'significant_hits': len(hits),
'depletions': len(depletions),
'enrichments': len(enrichments),
'qc_metrics': qc,
'output_files': {
'full_results': f"{output_prefix}_sgrna_results.csv",
'hits': f"{output_prefix}_hits.csv"
}
}
# Print summary
print(f"\n{'='*60}")
print("ANALYSIS SUMMARY")
print(f"{'='*60}")
print(f"Total sgRNAs: {summary['total_sgrnas']}")
print(f"Significant hits (FDR<{fdr_threshold}, |LFC|>{lfc_threshold}): {summary['significant_hits']}")
print(f" - Depletions: {summary['depletions']}")
print(f" - Enrichments: {summary['enrichments']}")
print(f"\nResults saved:")
print(f" - {summary['output_files']['full_results']}")
print(f" - {summary['output_files']['hits']}")
print(f"{'='*60}")
return summary
# Execute workflow
results = analyze_crispr_screen(
counts_file="sgrna_counts.txt",
samplesheet="samples.csv",
control_samples=["Ctrl_1", "Ctrl_2", "Ctrl_3"],
treatment_samples=["Drug_1", "Drug_2", "Drug_3"],
output_prefix="drug_resistance_screen",
fdr_threshold=0.05,
lfc_threshold=1.0
)
Expected Output Files:
analysis_results/
├── drug_resistance_screen_sgrna_results.csv # All sgRNA statistics
├── drug_resistance_screen_hits.csv # Significant hits only
└── qc_report.txt # Quality control summary
Scenario: Identify genes essential for cell survival by comparing T0 (transduction) vs T14 (14 days post-transduction).
{
"screen_type": "viability",
"comparison": "T14_vs_T0",
"expected_depletions": "Essential genes (ribosomal, splicing, etc.)",
"expected_enrichments": "None (unless suppressors of toxicity)",
"positive_controls": ["RPL30", "RPS19", "PCNA"],
"negative_controls": ["LacZ", "NTC"],
"analysis_parameters": {
"fdr_threshold": 0.05,
"lfc_threshold": 1.0,
"gene_aggregation": "mean"
}
}
Workflow:
Output Example:
Essential Gene Screen Results:
Total sgRNAs tested: 65,383
Significantly depleted: 3,847 sgRNAs (FDR<0.05, LFC<-1)
Top Essential Genes:
RPL30: mean LFC = -4.2, 5/5 sgRNAs significant
RPS19: mean LFC = -3.8, 4/5 sgRNAs significant
PCNA: mean LFC = -3.5, 5/5 sgRNAs significant
QC Metrics:
Gini index: 0.25 (excellent library representation)
Read depth: 25M per sample (sufficient)
Scenario: Identify genes whose knockout confers resistance to a cytotoxic drug (e.g., vemurafenib in BRAF-mutant melanoma).
{
"screen_type": "drug_resistance",
"treatment": "vemurafenib (2 μM)",
"control": "DMSO",
"duration": "14 days",
"expected_depletions": "Drug sensitizers, synthetic lethal",
"expected_enrichments": "Drug resistance genes",
"known_resistance_genes": ["NRAS", "MAP2K1", "MEK1"],
"analysis_parameters": {
"fdr_threshold": 0.05,
"lfc_threshold": 1.0,
"focus": "enrichments"
}
}
Workflow:
Output Example:
Drug Resistance Screen Results (Vemurafenib):
Significant enrichments: 156 sgRNAs (FDR<0.05, LFC>1)
Top Resistance Genes:
NRAS: mean LFC = +2.8, 4/5 sgRNAs enriched
MAP2K1: mean LFC = +2.5, 5/5 sgRNAs enriched
MED12: mean LFC = +2.1, 3/5 sgRNAs enriched
Validation recommended:
- Test individual sgRNAs in dose-response assay
- Confirm resistance phenotype with cell viability assay
- Check for known resistance mechanisms
Scenario: Identify genes that, when knocked out, sensitize cells to drug treatment (synthetic lethal interactions).
{
"screen_type": "drug_sensitivity",
"treatment": "PARP inhibitor (olaparib)",
"control": "DMSO",
"cell_line": "BRCA1-mutant ovarian cancer",
"expected_depletions": "DNA repair genes (synthetic lethal)",
"expected_enrichments": "Drug resistance mechanisms",
"known_synthetic_lethal": ["PARP1", "BRCA2", "PALB2"],
"analysis_parameters": {
"fdr_threshold": 0.05,
"lfc_threshold": 1.0,
"focus": "depletions"
}
}
Workflow:
Output Example:
Synthetic Lethality Screen (Olaparib in BRCA1-mutant):
Significant depletions: 234 sgRNAs (FDR<0.05, LFC<-1)
Top Synthetic Lethal Hits:
BRCA2: mean LFC = -3.2, 5/5 sgRNAs depleted
PALB2: mean LFC = -2.8, 4/5 sgRNAs depleted
RAD51C: mean LFC = -2.5, 5/5 sgRNAs depleted
Biological Interpretation:
- Strong enrichment of homologous recombination genes
- Consistent with known synthetic lethal interactions
- Potential combination therapy targets identified
Scenario: Compare genetic dependencies between two cell lines to identify lineage-specific vulnerabilities.
{
"screen_type": "comparative",
"comparison": "Melanoma_vs_Lung_cancer",
"cell_lines": ["A375", "SKMEL28", "A549", "H1299"],
"analysis_type": "differential_essentiality",
"expected_lineage_specific": {
"melanoma": ["MITF", "SOX10", "TYR"],
"lung": ["NKX2-1", "TP63"]
},
"analysis_parameters": {
"fdr_threshold": 0.05,
"lfc_threshold": 1.0,
"replicate_requirement": 2
}
}
Workflow:
Output Example:
Comparative Screen: Melanoma vs Lung Cancer
Melanoma-specific essential: 127 genes
Lung-specific essential: 203 genes
Common essential: 1,847 genes
Top Melanoma-Specific Dependencies:
MITF: LFC diff = -4.5 (essential in melanoma, not lung)
SOX10: LFC diff = -3.8
TYR: LFC diff = -3.2
Top Lung-Specific Dependencies:
NKX2-1: LFC diff = -3.9
TP63: LFC diff = -3.1
Therapeutic Implications:
- Lineage-specific targets identified
- Potential for tumor-type selective therapy
Pre-Analysis Checks:
During Analysis:
Post-Analysis Verification:
Before Validation or Publication:
Experimental Design Issues:
❌ Insufficient sequencing depth → Poor statistical power, missed hits
❌ Library bottleneck → Gini index >0.4, skewed representation
❌ Inadequate replicates → High variance, irreproducible results
❌ Wrong time point → Too early (no selection) or too late (extensive dropout)
Analysis Issues:
❌ Ignoring QC metrics → Analyzing poor quality data
❌ Incorrect sample assignment → Control/treatment mix-up
❌ Single sgRNA hits → Potential off-target effects
❌ Over-reliance on p-values → Many false positives with large library
Interpretation Issues:
❌ Ignoring cell number effects → Different growth rates confound results
❌ Off-target effects dominating → False positive hits
❌ Pan-essential vs selective → Misclassifying broadly essential genes
❌ Not validating hits → Publishing false positives
Technical Issues:
❌ Batch effects → Confounding by library prep or sequencing batch
❌ Contamination → Cross-sample contamination affects quantification
❌ Reference genome mismatch → sgRNAs not mapping correctly
❌ Incomplete annotation → sgRNAs missing gene mapping
Problem: No significant hits despite strong biological effect
Problem: Too many significant hits (1000s)
Problem: High Gini index (>0.4)
Problem: Known essential genes not identified
Problem: Discordant sgRNAs for same gene
Problem: Batch effects between replicates
Problem: Negative controls showing significant effects
Available in references/ directory:
External Resources:
Located in scripts/ directory:
main.py - CRISPR screen analysis engine with QC, RRA, and hit identification| Screen Type | Comparison | Expected Hits | Typical Duration |
|---|---|---|---|
| Viability | T14 vs T0 | Essential genes depleted | 10-14 days |
| Drug Resistance | Drug vs DMSO | Resistance genes enriched | 14-21 days |
| Drug Sensitivity | Drug vs DMSO | Sensitizers depleted | 14-21 days |
| Comparative | Cell A vs Cell B | Lineage-specific dependencies | 10-14 days |
| Sensitizer | Drug A+B vs Drug A | Combination targets | 10-14 days |
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
--counts, -c | string | - | Yes | sgRNA count matrix file |
--samples, -s | string | - | Yes | Sample annotation file |
--control | string | - | No | Control samples (comma-separated) |
--treatment, -t | string | - | No | Treatment samples (comma-separated) |
--output, -o | string | - | No | Output directory |
--fdr | float | 0.05 | No | FDR threshold |
# Analyze CRISPR screen data
python scripts/main.py --counts sgrna_counts.txt --samples samplesheet.csv
# With specific control and treatment
python scripts/main.py --counts counts.txt --samples samples.csv --control "Ctrl1,Ctrl2" --treatment "Treat1,Treat2"
# Custom FDR threshold
python scripts/main.py --counts counts.txt --samples samples.csv --fdr 0.01 --output ./results
| Risk Indicator | Assessment | Level |
|---|---|---|
| Code Execution | Python script executed locally | Low |
| Network Access | No external API calls | Low |
| File System Access | Read count files, write results | Low |
| Data Exposure | Processes genomic screening data | Medium |
| PHI Risk | May contain cell line genetic info | Low |
# Python 3.7+
numpy
pandas
scipy
Last Updated: 2026-02-09
Skill ID: 183
Version: 2.0 (K-Dense Standard)