Identify broad ChIP-seq enrichment regions from treatment/control BED signal intervals using overlap-aware fold-enrichment, interval merging, and output normalization. Use when treatment and control coordinates are shifted and a strict 4-column broad peak file is required.
chrom start end signal) into broad peaks.chrom\tstart\tend\tpeak_signal) with no header/comments.Inspect input compatibility before computing enrichment.
Check row counts and first-coordinate alignment; do not assume row-wise matching.
Parse treatment and control intervals into chromosome-grouped, start-sorted lists. Ignore malformed lines and zero/negative-length intervals.
Compute control background per treatment interval using overlap weighting.
For each treatment interval, scan same-chromosome control intervals:
treatment >= 3 * control_background).Keep enriched treatment intervals and sort by chromosome/start.
Merge adjacent enriched intervals using the allowed gap (e.g., 500 bp).
While merging:
Filter merged peaks by final thresholds.
Keep only peaks with:
end - start >= min_widthmerged_signal >= min_peak_signalWrite final output exactly as required.
Using exact coordinate keys between treatment/control.
Evidence: successful runs showed treatment/control were not aligned and even had different row counts. Exact matching can misclassify shifted overlaps as missing control.
Depending on unavailable external tools (e.g., bedtools) without fallback.
Evidence: one run failed mid-pipeline with bedtools: command not found, producing empty intermediates.
Not checking for empty intermediates after command failure.
Evidence: downstream merge step ran on an empty file and produced 0 output peaks until recomputed.
Skipping output normalization constraints (format/order/thresholds).
Tests explicitly check 4 columns, sorted order, minimum width/signal, no headers/comments, fold enrichment, and chromosome validity.
Run lightweight structural and rule checks before submission:
# 1) File exists and non-empty
test -s /workspace/broad_peaks.txt
# 2) 4 fields, numeric coords/signal, thresholds, sorted
awk 'BEGIN{ok=1; pc=""; ps=-1}
NF!=4 {print "BAD_FIELDS",NR,$0; ok=0}
$2!~/^[0-9]+$/ || $3!~/^[0-9]+$/ || $4!~/^[0-9]+(\.[0-9]+)?$/ {print "BAD_TYPES",NR,$0; ok=0}
($3-$2)<1000 {print "BAD_WIDTH",NR,$0; ok=0}
($4+0)<100 {print "BAD_SIGNAL",NR,$0; ok=0}
(pc!="" && ($1<pc || ($1==pc && $2<ps))) {print "BAD_SORT",NR,$0; ok=0}
{pc=$1; ps=$2}
END{if(ok) print "VALID"}' /workspace/broad_peaks.txt
If task tests are available, run them (e.g., pytest /tests/test_outputs.py -rA) to confirm fold-enrichment and expected-region assertions.