Name: Skill: experiment-reproduction-and-result-verification
Author: Dingxingdi

1. Capability Definition & Real Case

Professional Definition: The ability to run, monitor, validate, and assess whether an implemented research codebase actually reproduces the empirical results or intermediate execution outcomes required by a benchmark or rubric.
Dimension Hierarchy: Research Reproduction Engineering->Reproduction and Evaluation->experiment-reproduction-and-result-verification

[Case 1]

Initial Environment: A repository includes implementation code, a reproduce.sh entrypoint, documentation, and a rubric describing what outcomes count as successful replication. Running the script generates logs, tables, and plots.
Real Question: Execute the reproduction pipeline and determine whether the target experimental results have been successfully reproduced.
Real Trajectory: Run reproduce.sh in a clean environment, inspect reproduce.log and generated outputs, compare observed artifacts against the required result criteria, and record which parts fully match, partially match, or fail.

Professional Definition: The ability to run, monitor, validate, and assess whether an implemented research codebase actually reproduces the empirical results or intermediate execution outcomes required by a benchmark or rubric.
Dimension Hierarchy: Research Reproduction Engineering->Reproduction and Evaluation->experiment-reproduction-and-result-verification

[Case 1]

Initial Environment: A repository includes implementation code, a reproduce.sh entrypoint, documentation, and a rubric describing what outcomes count as successful replication. Running the script generates logs, tables, and plots.
Real Question: Execute the reproduction pipeline and determine whether the target experimental results have been successfully reproduced.
Real Trajectory: Run reproduce.sh in a clean environment, inspect reproduce.log and generated outputs, compare observed artifacts against the required result criteria, and record which parts fully match, partially match, or fail.

Skill: experiment-reproduction-and-result-verification