Reproduce and isolate a model latency gap between two OpenVINO commits. Find the big gap nodes between commits. Map hot OpenVINO nodes to oneDNN verbose. Convert verbose to benchdnn repro, and validate the gap at oneDNN level.
Use this guide when the same model has different latency on two commits and you want a deterministic oneDNN-level repro.
benchmark_app.onednn_verbose lines.benchdnn.model: full path to model xml (or model dir accepted by your benchmark_app build).ref_commit: known-good commit hash.bad_commit: known-bad commit hash.gap_analysis.txt: summary of latency difference, hotspot nodes, and oneDNN perf difference.benchdnn_command.txt: benchdnn command files for both commits.For example,
/mnt/disk1/xiuchuan/oneDNN_perf_bug/model.# Base OpenVINO repo and two worktrees (recommended)
export OV_REF_REPO=/path/to/ref
export OV_BAD_REPO=/path/to/bad
export REF_COMMIT=<ref_commit>
export BAD_COMMIT=<bad_commit>
mkdir ${OV_REF_REPO}
cd ${OV_REF_REPO}
git clone https://github.com/openvinotoolkit/openvino.git
cd openvino
git reset --hard $REF_COMMIT
git submodule update --init --recursive
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo
make -j$(nproc)
export MODEL_XML=<model>
benchmark_app in latency modeRun the exact same command on both branches.
# COMMIT A
cd "$OV_REF_REPO"
source venv/ov/bin/activate
benchmark_app -m "$MODEL_XML" -d "$DEVICE" -hint latency -t 10 -niter 1000 \
${SHAPE:+-shape "$SHAPE"} \
> "$OUT/A/bench_latency.log" 2>&1
deactivate
# COMMIT B
cd "$OV_BAD_REPO"
source venv/ov/bin/activate
benchmark_app -m "$MODEL_XML" -d "$DEVICE" -hint latency -t 10 -niter 1000 \
${SHAPE:+-shape "$SHAPE"} \
> "$OUT/B/bench_latency.log" 2>&1
deactivate
Extract latency summary:
grep -E "Latency|Average|Median|percentile" "$OUT/A/bench_latency.log" | tee "$OUT/A/latency_summary.txt"
grep -E "Latency|Average|Median|percentile" "$OUT/B/bench_latency.log" | tee "$OUT/B/latency_summary.txt"
Enable OpenVINO performance counters and runtime verbose in both branches.
# COMMIT A (repeat same in B)
cd "$OV_REF_REPO"
source venv/ov/bin/activate
benchmark_app -m "$MODEL_XML" -d "$DEVICE" -hint latency -t 30 -niter 300 -pc \
${SHAPE:+-shape "$SHAPE"} \
> "$OUT/A/pc_node.log" 2>&1
Collect top PC nodes (longest first):
grep -E "^\[ INFO \] +[0-9]+ +[0-9.]+ +[0-9.]+ +.*" "$OUT/A/pc_node.log" | tail -n +1 > "$OUT/A/pc_nodes.txt"
grep -E "^\[ INFO \] +[0-9]+ +[0-9.]+ +[0-9.]+ +.*" "$OUT/B/pc_node.log" | tail -n +1 > "$OUT/B/pc_nodes.txt"
Pick target nodes (for example top 3 by total time) and record their op names in:
$OUT/target_nodes.txtonednn_verbose for found nodesFirst capture full oneDNN verbose, then filter to primitive kinds/shapes that correspond to target nodes.
# COMMIT A (repeat same in B)
cd "$OV_REF_REPO"
source venv/ov/bin/activate
OV_CPU_VERBOSE=1 ONEDNN_VERBOSE=all \
benchmark_app -m "$MODEL_XML" -d "$DEVICE" -hint latency -t 30 -niter 300 -pc \
${SHAPE:+-shape "$SHAPE"} \
> "$OUT/A/pc_verbose.log" 2>&1
# Full logs already in pc_verbose.log. Extract oneDNN lines:
grep '^onednn_verbose,' "$OUT/A/pc_verbose.log" > "$OUT/A/onednn_full.log"
grep '^onednn_verbose,' "$OUT/B/pc_verbose.log" > "$OUT/B/onednn_full.log"
# Example: focus on matmul/ip/reorder that usually dominate LLM workloads
grep -E 'onednn_verbose,(exec|create),cpu,(matmul|inner_product|reorder),' "$OUT/A/onednn_full.log" > "$OUT/A/onednn_focus.log"
grep -E 'onednn_verbose,(exec|create),cpu,(matmul|inner_product|reorder),' "$OUT/B/onednn_full.log" > "$OUT/B/onednn_focus.log"
If you already know exact shape fragments from verbose, filter tighter, e.g.:
grep 'mb1ic4096oc4096' "$OUT/A/onednn_focus.log" > "$OUT/A/onednn_target.log"
grep 'mb1ic4096oc4096' "$OUT/B/onednn_focus.log" > "$OUT/B/onednn_target.log"
In OpenVINO tree, oneDNN is here:
src/plugins/intel_cpu/thirdparty/onednnBuild Release + benchdnn:
# Branch A
cd "$BR_A/src/plugins/intel_cpu/thirdparty/onednn"
cmake -S . -B build-release \
-DCMAKE_BUILD_TYPE=Release \
-DDNNL_BUILD_TESTS=ON \
-DDNNL_BUILD_EXAMPLES=OFF
cmake --build build-release --target benchdnn -j"$(nproc)"
# Branch B
cd "$BR_B/src/plugins/intel_cpu/thirdparty/onednn"
cmake -S . -B build-release \
-DCMAKE_BUILD_TYPE=Release \
-DDNNL_BUILD_TESTS=ON \
-DDNNL_BUILD_EXAMPLES=OFF
cmake --build build-release --target benchdnn -j"$(nproc)"
Use oneDNN’s official converter script in the same branch:
# Branch A
cd "$OUT/A"
python3 "$BR_A/src/plugins/intel_cpu/thirdparty/onednn/scripts/verbose_converter/verbose_converter.py" \
-i onednn_focus.log \
-o onednn_cases.cmd
# Branch B
cd "$OUT/B"
python3 "$BR_B/src/plugins/intel_cpu/thirdparty/onednn/scripts/verbose_converter/verbose_converter.py" \
-i onednn_focus.log \
-o onednn_cases.cmd
This generates benchdnn-compatible command fragments grouped by driver.
benchdnnRun equivalent benchdnn cases on both branches and compare throughput/time. condnn_cases.cmd may include many commands. But each comparion should compare the same command. That makes sense.
# COMMIT A.
# for example, $cmd_a is in "$OUT/A/onednn_cases.cmd"
cd "$BR_A/src/plugins/intel_cpu/thirdparty/onednn"
./build-release/tests/benchdnn/benchdnn --mode=P cmd_a
# COMMIT B
# for example, $cmd_b is in "$OUT/B/onednn_cases.cmd"
cd "$BR_B/src/plugins/intel_cpu/thirdparty/onednn"
./build-release/tests/benchdnn/benchdnn --mode=P cmd_b
Node: cmd_a and cmd_b should be same.
If the output gap is abovious, it proves that such verbose(command) is the root cause of performance drop.
$OUT/A/latency_summary.txt vs $OUT/B/latency_summary.txt.