Optimize BigQuery SQL query through iterative improvement
You are a program who's mission is to optimize a BigQuery SQL query. Here are the steps you should follow:
progress.csv does not exist, this means this is the first step of the optimization process. In this case:progress.csv as the first entry.<query_name>.plan.json.<query_name>.stats.json by querying the results destination table.progress.csv file to:progress.csv.<query_name>_step<step_number>.plan.json.--query <path>: Path to a .sql file containing the query to optimize. The file name (without extension) is used as the query name for job IDs and artifact files.
--budget <float>: The maximum amount (in USD) to spend on query execution during the optimization process. Default is 100 USD.
--iterations <int>: The maximum number of optimization steps to perform. Default is 10.
Optimization progress is tracked in a CSV file called progress.csv. Each row in the file represents a single optimization step, and contains the following columns:
step: The step number of the optimization process.runtime: The runtime of the optimized query in this step, measured in seconds.cost: The cost of the optimized query in this step, estimated in USD.description: A brief description of the optimization that was performed in this step.query: The SQL query after the optimization step was applied.All queries are run as asynchronous jobs using bq query --nosync, and results are written to session-scoped temp tables using CREATE TEMP TABLE ... AS. This avoids "response too large" errors (which occur even with --nosync for large result sets) and gives us control over job IDs and budget limits.
All queries within an optimization run share a BigQuery session. Results stored as temp tables can be queried directly for computing statistics without re-running the original query.
Create a session at the start of an optimization run:
SESSION_ID=$(bq query --nouse_legacy_sql --format=json --create_session \
--location=<DATA_LOCATION> \
--label=qthun:session \
'SELECT 1' | grep 'In session:' | awk '{print $3}')
The --location must match the location of the dataset being queried (e.g., EU, US).
Pass the session ID to all subsequent queries using --session_id:
--session_id=$SESSION_ID
Benefits of using a session:
Derive the query name from the file path (e.g., process_steps_impact.sql → process_steps_impact). Use it to build predictable job IDs.
Wrap the query in CREATE TEMP TABLE ... AS to store results in a named session temp table. This avoids "response too large" errors and makes results easily queryable for stats.
QUERY_NAME="process_steps_impact"
STEP=0
bq query \
--nouse_legacy_sql \
--nosync \
--job_id="qthun_${QUERY_NAME}_step${STEP}" \
--label=qthun:${QUERY_NAME} \
--maximum_bytes_billed=<remaining_budget_in_bytes> \
--session_id=$SESSION_ID \
"CREATE OR REPLACE TEMP TABLE qthun_step${STEP} AS $(cat ${QUERY_NAME}.sql)"
Then poll for completion:
bq wait "qthun_${QUERY_NAME}_step${STEP}"
| Flag | Purpose |
|---|---|
--nosync | Submit asynchronously, return immediately. Avoids "response too large" errors. |
--job_id | Predictable ID: qthun_<query_name>_step<N>. Makes jobs easy to reference later. |
--label | Tag as qthun:<query_name> for cost tracking in billing exports. |
--maximum_bytes_billed | Fail the query if it would scan more than this many bytes. Use to enforce the remaining budget. Convert USD to bytes: budget_usd / 6.25 * (1024^4). |
--session_id | Attach to the session. Results are stored in session-scoped destination tables. |
After the job completes, use bq show to get execution statistics:
bq show --format=json -j "qthun_${QUERY_NAME}_step${STEP}"
Extract the relevant fields from the JSON output:
statistics.endTime - statistics.startTime (both in milliseconds since epoch). Compute the difference and convert to seconds.statistics.query.totalBytesBilled. BigQuery on-demand pricing is based on bytes billed. To estimate cost in USD: totalBytesBilled / (1024^4) * 6.25 ($6.25 per TiB).Since results are stored in named temp tables (qthun_step0, qthun_step1, etc.), you can query them directly by name within the session:
# Sample rows
bq query --nouse_legacy_sql --format=json \
--session_id=$SESSION_ID \
"SELECT * FROM qthun_step${STEP} LIMIT 10"
# Compute stats
bq query --nouse_legacy_sql --format=json \
--session_id=$SESSION_ID \
"SELECT COUNT(*) as row_count, ... FROM qthun_step${STEP}"
The execution plan is included in the job metadata returned by bq show. From the JSON output of bq show --format=json -j <job_id>, extract the statistics.query.queryPlan field. This is an array of query stages, each containing:
name: Stage name.id: Stage ID.inputStages: IDs of stages that feed into this one.startMs / endMs: Timing for the stage.waitMsAvg / waitMsMax: Time spent waiting for slots.readMsAvg / readMsMax: Time spent reading input.computeMsAvg / computeMsMax: Time spent in computation.writeMsAvg / writeMsMax: Time spent writing output.recordsRead / recordsWritten: Row counts in and out.shuffleOutputBytes / shuffleOutputBytesSpilled: Data shuffled between stages (spill indicates memory pressure).steps: Array of sub-operations within the stage (e.g., READ, COMPUTE, WRITE, AGGREGATE), each with a substeps description.Save the full queryPlan array to <query_name>.plan.json (or <query_name>_step<N>.plan.json for optimization steps). Use the plan to identify bottlenecks such as large shuffles, data skew, spills to disk, or stages with disproportionately high compute time.
If we edit a query to optimize it, it's important to not change its behavior. To ensure this, we compute identifying statistics by querying the session-scoped destination table of each step's results (see "Reading query results" above). This should include:
These statistics are to be saved in a <query_name>.stats.json file. After each optimization step, we should compare the statistics of the optimized query to the original query to ensure they match. If they don't match, this indicates that the optimization has changed the behavior of the query, which is not acceptable.
WITH clauses in BigQuery are not materialized — they act like macros and are inlined everywhere they're referenced. If a CTE is referenced multiple times, this causes duplicate execution of the same stages. When a multiply-referenced CTE is expensive, replace it with a CREATE TEMP TABLE to materialize it once.
BigQuery may not reorder WHERE expressions. Place the most selective condition first in AND clauses to enable short-circuiting. For OR clauses, place the least selective first. Prefer simple equalities on BOOL/INT/FLOAT/DATE columns before expensive operations like LIKE or REGEXP. Replace REGEXP_CONTAINS with LIKE when a simple pattern suffices.
Avoid casting columns in WHERE clauses (e.g., CAST(date_col AS STRING)). Filtering on native types enables partition and cluster pruning at the storage level, rather than requiring slot workers to do the filtering.
Perform aggregations or filtering in subqueries before joining, so that the join operates on fewer rows. This reduces shuffle volume. Especially important when join keys are non-unique on both sides, which causes row explosion.
Place the largest table first in the FROM clause, followed by the smallest, then by decreasing size. BigQuery's optimizer only auto-reorders under specific conditions, so manual ordering is recommended.
Push WHERE filters to apply to both sides of a join, not just one. Check the execution plan to confirm filters are applied as early as possible. If not, use subqueries to pre-filter each table before joining.
Wrapping join columns in functions (e.g., TRIM(), LOWER()) prevents BigQuery from performing join optimizations like snowflake join optimization. Clean data during ingestion/ELT instead. This alone can yield 90%+ improvement in slot time.
Self-joins used for row-dependent relationships (e.g., comparing a row to its previous row) can square the output size. Replace with window functions like LAG(), LEAD(), ROW_NUMBER(), etc.
Aggregate as late and seldom as possible, since aggregation is costly. Exception: if early aggregation drastically reduces the row count before a join, aggregate early. This only works when both sides of the join are already at the same granularity (one row per join key value).
If tables are clustered by a shared key (e.g., account_slug), consider adding that key to join conditions. This can enable co-located joins and reduce shuffles — but test both with and without, as the extra column in the join/shuffle key can increase bytes billed and runtime in some cases.
When adding clustering keys to joins, place the clustering key first in the ON clause (e.g., ON a.account_slug = b.account_slug AND a.id = b.id). In testing, this ordering was ~10% faster than placing it last, likely because BigQuery uses the first condition to guide partition/cluster pruning.
Note: BigQuery may already leverage clustering keys at the scan level even without them in join conditions. Adding them to joins is not always beneficial — always compare runtime and bytes billed before and after.
Feel free to come up with your own optimization techniques, given the execution plan and statistics.