Review a DataFusion Comet pull request for Spark compatibility and implementation correctness. Provides guidance to a reviewer rather than posting comments directly.
Review Comet PR #$ARGUMENTS
Fetch the PR details to understand the scope:
gh pr view $ARGUMENTS --repo apache/datafusion-comet --json title,body,author,isDraft,state,files
Before forming your review:
# View existing comments on a PR
gh pr view $ARGUMENTS --repo apache/datafusion-comet --comments
Read the changed files and understand the area of the codebase being modified:
# View the diff
gh pr diff $ARGUMENTS --repo apache/datafusion-comet
For expression PRs, check how similar expressions are implemented in the codebase. Look at the serde files in spark/src/main/scala/org/apache/comet/serde/ and Rust implementations in native/spark-expr/src/.
For any PR that adds or modifies an expression, you must read the Spark source code to understand the canonical behavior. This is the authoritative reference for what Comet must match.
Clone or update the Spark repo:
# Clone if not already present (use /tmp to avoid polluting the workspace)
if [ ! -d /tmp/spark ]; then
git clone --depth 1 https://github.com/apache/spark.git /tmp/spark
fi
Find the expression implementation in Spark:
# Search for the expression class (e.g., for "Conv", "Hex", "Substring")
find /tmp/spark/sql/catalyst/src/main/scala -name "*.scala" | xargs grep -l "case class <ExpressionName>"
Read the Spark implementation carefully. Pay attention to:
eval and doGenEval/nullSafeEval methods. These define the exact behavior.inputTypes and dataType fields. These define which types Spark accepts and what it returns.nullable = true? Does nullSafeEval handle nulls implicitly?require assertions.SQLConf.get.ansiEnabled or failOnError).Read the Spark tests for the expression:
# Find test files
find /tmp/spark/sql -name "*.scala" -path "*/test/*" | xargs grep -l "<ExpressionName>"
Compare the Spark behavior against the Comet implementation in the PR. Identify:
IncompatibleSuggest additional tests for any edge cases or type combinations covered in Spark's tests that are missing from the PR's tests.
This is the most critical aspect of Comet reviews. Comet must produce identical results to Spark.
For expression PRs, verify against the Spark source you read in step 2:
Check edge cases
Verify all data types are handled
inputTypes in Spark source)Check for ANSI mode differences
IncompatibleAlways verify PRs follow the implementation guidelines.
spark/src/main/scala/org/apache/comet/serde/)exprToProtoInternalgetSupportLevel reflects true compatibility:
Compatible() - matches Spark exactlyIncompatible(Some("reason")) - differs in documented waysUnsupported(Some("reason")) - cannot be implementeddatetime.scala, strings.scala, arithmetic.scala, etc.)QueryPlanSerde.scala)Location: native/spark-expr/src/
Result types.Expression tests should use the SQL file-based framework (CometSqlFileTestSuite) where possible. This framework automatically runs each query through both Spark and Comet and compares results. No Scala code is needed. Only fall back to Scala tests in CometExpressionSuite when the SQL framework cannot express the test. Examples include complex DataFrame setup, programmatic data generation, or non-expression tests.
SQL file test location: spark/src/test/resources/sql-tests/expressions/<category>/
Categories include: aggregate/, array/, string/, math/, struct/, map/, datetime/, hash/, etc.
SQL file structure:
-- Create test data
statement
CREATE TABLE test_crc32(col string, a int, b float) USING parquet
statement
INSERT INTO test_crc32 VALUES ('Spark', 10, 1.5), (NULL, NULL, NULL), ('', 0, 0.0)
-- Default mode: verifies native Comet execution + result matches Spark
query
SELECT crc32(col) FROM test_crc32
-- spark_answer_only: compares results without requiring native execution
query spark_answer_only
SELECT crc32(cast(a as string)) FROM test_crc32
-- tolerance: allows numeric variance for floating-point results
query tolerance=0.0001
SELECT cos(v) FROM test_trig
-- expect_fallback: asserts fallback to Spark occurs
query expect_fallback(unsupported expression)
SELECT unsupported_func(v) FROM test_table
-- expect_error: verifies both engines throw matching exceptions
query expect_error(ARITHMETIC_OVERFLOW)
SELECT 2147483647 + 1
-- ignore: skip queries with known bugs (include GitHub issue link)
query ignore(https://github.com/apache/datafusion-comet/issues/NNNN)
SELECT known_buggy_expr(v) FROM test_table
Running SQL file tests:
# All SQL file tests
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite" -Dtest=none
# Specific test file (substring match)
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite crc32" -Dtest=none
CRITICAL: Verify all test requirements (regardless of framework):
SELECT expression(NULL))withSQLConf(SQLConf.OPTIMIZER_EXCLUDED_RULES.key ->
"org.apache.spark.sql.catalyst.optimizer.ConstantFolding") {
checkSparkAnswerAndOperator("SELECT func(literal)")
}
For PRs that add new expressions, performance is not optional. The whole point of Comet is to be faster than Spark. If a new expression is not faster, it may not be worth adding.
Check that the PR includes microbenchmark results. The PR description should contain benchmark numbers comparing Comet vs Spark for the new expression. If benchmark results are missing, flag this as a required addition.
Look for a microbenchmark implementation. Expression benchmarks live in spark/src/test/scala/org/apache/spark/sql/benchmark/. Check whether the PR adds a benchmark for the new expression.
Review the benchmark results if provided:
Review the Rust implementation for performance concerns:
If benchmark results show Comet is slower than Spark, flag this clearly. The PR should explain why the regression is acceptable or include a plan to optimize.
Always check the CI status and summarize any test failures in your review.
# View CI check status
gh pr checks $ARGUMENTS --repo apache/datafusion-comet
# View failed check details
gh pr checks $ARGUMENTS --repo apache/datafusion-comet --failed
Check whether the PR requires updates to user-facing documentation in docs/:
docs/source/user-guide/compatibility.md): New expressions or operators should be listed. Incompatible behaviors should be documented.docs/source/user-guide/configs.md): New config options should be documented.docs/source/user-guide/expressions.md): New expressions should be added.If the PR adds a new expression or operator but does not update the relevant docs, flag this as something that needs to be addressed.
CometSqlFileTestSuite) rather than adding to Scala test suites like CometExpressionSuite. Suggest migration if the PR adds Scala tests for expressions that could use SQL files instead../mvnw install -pl common -DskipTestsgetSupportLevel: Edge cases should be marked as IncompatiblePresent your review as guidance for the reviewer. Structure your output as:
Write reviews that sound human and conversational. Avoid:
Instead:
IMPORTANT: Never post comments or reviews on the PR directly. This skill is for providing guidance to a human reviewer. Present all findings and suggested comments to the user. The user will decide what to post.