Use when writing BigQuery queries, optimizing query performance, analyzing execution plans, or avoiding common SQL gotchas. Covers parameterized queries, UDFs, scripting, window functions (QUALIFY, ROW_NUMBER, RANK, LEAD/LAG), JSON functions, ARRAY/STRUCT operations, BigQuery-specific features (EXCEPT, REPLACE, SAFE_*), CTE re-execution issues, NOT IN with NULLs, DML performance, Standard vs Legacy SQL, and performance best practices.
Use this skill when writing, debugging, or optimizing BigQuery SQL queries for performance and efficiency.
-- Get execution plan
EXPLAIN SELECT * FROM `project.dataset.table` WHERE condition;
-- Get execution plan with runtime stats
EXPLAIN ANALYZE SELECT * FROM `project.dataset.table` WHERE condition;
What the plan shows:
❌ Bad (scans all columns):
SELECT * FROM `project.dataset.large_table`
✅ Good (only needed columns):
SELECT customer_id, amount, date
FROM `project.dataset.large_table`
Impact: Full table scan vs targeted column read. Can reduce data scanned by 90%+.
❌ Bad (filter after aggregation):
SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
GROUP BY customer_id
HAVING SUM(amount) > 1000
✅ Good (filter before aggregation):
SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
WHERE amount > 100 -- Filter early
GROUP BY customer_id
HAVING SUM(amount) > 1000
Without partition filter:
-- Scans entire table
SELECT * FROM `project.dataset.orders`
WHERE order_date >= '2024-01-01'
With partition filter:
-- Only scans relevant partitions
SELECT * FROM `project.dataset.orders`
WHERE DATE(order_timestamp) >= '2024-01-01' -- Partition column
Key: Filter on the partition column for automatic partition pruning.
❌ Anti-pattern (one huge query):
SELECT ...
FROM (
SELECT ...
FROM (
SELECT ... -- Deeply nested
)
)
WHERE ...
✅ Good (use CTEs):
WITH base_data AS (
SELECT customer_id, amount, date
FROM `project.dataset.orders`
WHERE date >= '2024-01-01'
),
aggregated AS (
SELECT customer_id, SUM(amount) as total
FROM base_data
GROUP BY customer_id
)
SELECT * FROM aggregated WHERE total > 1000
✅ Better (multi-statement with temp tables):
CREATE TEMP TABLE base_data AS
SELECT customer_id, amount, date
FROM `project.dataset.orders`
WHERE date >= '2024-01-01';
CREATE TEMP TABLE aggregated AS
SELECT customer_id, SUM(amount) as total
FROM base_data
GROUP BY customer_id;
SELECT * FROM aggregated WHERE total > 1000;
Put largest table first:
-- ✅ Large table first
SELECT l.*, s.detail
FROM `project.dataset.large_table` l
JOIN `project.dataset.small_table` s
ON l.id = s.id
Use clustering on JOIN columns:
Consider ARRAY/STRUCT for 1:many:
-- Instead of JOIN for 1:many relationships
SELECT
order_id,
ARRAY_AGG(STRUCT(product_id, quantity, price)) as items
FROM `project.dataset.order_items`
GROUP BY order_id
BigQuery automatically performs:
Ensure your queries enable these:
bq query \
--use_legacy_sql=false \
--parameter=start_date:DATE:2024-01-01 \
--parameter=end_date:DATE:2024-12-31 \
--parameter=min_amount:FLOAT64:100.0 \
'SELECT *
FROM `project.dataset.orders`
WHERE order_date BETWEEN @start_date AND @end_date
AND amount >= @min_amount'
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
WHERE order_date >= @start_date
GROUP BY customer_id
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("start_date", "DATE", "2024-01-01")
]
)
results = client.query_and_wait(query, job_config=job_config)
Key points:
@ prefix for named parametersname:TYPE:value or name::value (STRING default)CREATE TEMP FUNCTION CleanEmail(email STRING)
RETURNS STRING
AS (
LOWER(TRIM(email))
);
SELECT CleanEmail(customer_email) as email
FROM `project.dataset.customers`;
CREATE TEMP FUNCTION ParseUserAgent(ua STRING)
RETURNS STRUCT<browser STRING, version STRING>
LANGUAGE js AS r"""
var match = ua.match(/(Chrome|Firefox|Safari)\/(\d+)/);
return {
browser: match ? match[1] : 'Unknown',
version: match ? match[2] : '0'
};
""";
SELECT ParseUserAgent(user_agent).browser as browser
FROM `project.dataset.sessions`;
Limitations:
-- Create once, use many times
CREATE FUNCTION `project.dataset.clean_email`(email STRING)
RETURNS STRING
AS (LOWER(TRIM(email)));
-- Use anywhere
SELECT `project.dataset.clean_email`(email) FROM ...
DECLARE total_orders INT64;
SET total_orders = (SELECT COUNT(*) FROM `project.dataset.orders`);
SELECT total_orders;
LOOP:
DECLARE x INT64 DEFAULT 0;
LOOP
SET x = x + 1;
IF x >= 10 THEN
LEAVE;
END IF;
END LOOP;
WHILE:
DECLARE x INT64 DEFAULT 0;
WHILE x < 10 DO
SET x = x + 1;
END WHILE;
FOR with arrays:
DECLARE ids ARRAY<STRING>;
SET ids = ['id1', 'id2', 'id3'];
FOR item IN (SELECT * FROM UNNEST(ids) as id)
DO
-- Process each id
SELECT id;
END FOR;
Automatic caching (24 hours):
To bypass cache:
bq query --use_cache=false 'SELECT...'
-- LIMIT doesn't reduce data scanned or cost!
SELECT * FROM `project.dataset.huge_table` LIMIT 10
Impact: Still scans entire table. Use WHERE filters instead.
-- Prevents partition pruning
WHERE CAST(date_column AS STRING) = '2024-01-01'
✅ Better:
WHERE date_column = DATE('2024-01-01')
-- Cartesian product = huge result
SELECT * FROM table1 CROSS JOIN table2
Impact: Can generate millions/billions of rows.
-- Runs subquery for each row
SELECT *
FROM orders o
WHERE amount > (SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id)
✅ Better (use window functions):
SELECT *
FROM (
SELECT *, AVG(amount) OVER (PARTITION BY customer_id) as avg_amount
FROM orders
)
WHERE amount > avg_amount
Problem: When a CTE is referenced multiple times, BigQuery re-executes it each time, billing you multiple times.
❌ Bad (CTE runs 3 times - billed 3x):
WITH expensive_cte AS (
SELECT * FROM `project.dataset.huge_table`
WHERE complex_conditions
AND lots_of_joins
)
SELECT COUNT(*) FROM expensive_cte
UNION ALL
SELECT SUM(amount) FROM expensive_cte
UNION ALL
SELECT MAX(date) FROM expensive_cte;
Impact: If the CTE scans 10 TB, you're billed for 30 TB (10 TB × 3).
✅ Good (use temp table - billed 1x):
CREATE TEMP TABLE expensive_data AS
SELECT * FROM `project.dataset.huge_table`
WHERE complex_conditions
AND lots_of_joins;
SELECT COUNT(*) FROM expensive_data
UNION ALL
SELECT SUM(amount) FROM expensive_data
UNION ALL
SELECT MAX(date) FROM expensive_data;
When CTEs are OK:
When to use temp tables:
Problem: NOT IN returns NOTHING (empty result) if ANY NULL exists in the subquery.
❌ Broken (returns empty if blocked_customers has any NULL):
SELECT * FROM customers
WHERE customer_id NOT IN (
SELECT customer_id FROM blocked_customers -- If ANY NULL, returns 0 rows!
);
Why it fails:
NULL IN (...) evaluates to UNKNOWNNOT UNKNOWN is still UNKNOWN✅ Solution 1: Use NOT EXISTS (safest):
SELECT * FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM blocked_customers b
WHERE b.customer_id = c.customer_id
);
✅ Solution 2: Filter NULLs explicitly:
SELECT * FROM customers
WHERE customer_id NOT IN (
SELECT customer_id
FROM blocked_customers
WHERE customer_id IS NOT NULL -- Explicit NULL filter
);
✅ Solution 3: Use LEFT JOIN:
SELECT c.*
FROM customers c
LEFT JOIN blocked_customers b
ON c.customer_id = b.customer_id
WHERE b.customer_id IS NULL;
Best practice: Prefer NOT EXISTS - it's clearer, safer, and often faster.
Problem: BigQuery is optimized for analytics (OLAP), not transactional updates (OLTP). DML statements are slow and expensive.
Why DML is slow in BigQuery:
❌ Very slow (row-by-row updates):
-- Don't do this - takes minutes/hours
UPDATE `project.dataset.orders`
SET status = 'processed'
WHERE order_id = '12345';
-- This is even worse - runs once per row
FOR record IN (SELECT order_id FROM orders_to_update)
DO
UPDATE orders SET status = 'processed' WHERE order_id = record.order_id;
END FOR;
⚠️ Better (batch updates):
UPDATE `project.dataset.orders`
SET status = 'processed'
WHERE order_id IN (SELECT order_id FROM orders_to_update);
✅ Best (recreate table - fastest):
CREATE OR REPLACE TABLE `project.dataset.orders` AS
SELECT
* EXCEPT(status),
CASE
WHEN order_id IN (SELECT order_id FROM orders_to_update)
THEN 'processed'
ELSE status
END AS status
FROM `project.dataset.orders`;
For INSERT/UPSERT - use MERGE:
MERGE `project.dataset.customers` AS target
USING `project.dataset.customer_updates` AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET name = source.name, updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
INSERT (customer_id, name, created_at)
VALUES (customer_id, name, CURRENT_TIMESTAMP());
Best practices:
INT64 is the only integer type:
-- All of these are the same: INT64
CREATE TABLE example (
col1 INT64, -- ✅ Explicit
col2 INTEGER, -- Converted to INT64
col3 INT, -- Converted to INT64
col4 SMALLINT, -- Converted to INT64
col5 BIGINT -- Converted to INT64
);
No UUID type - use STRING:
-- PostgreSQL
CREATE TABLE users (id UUID);
-- BigQuery
CREATE TABLE users (id STRING); -- Store UUID as string
NUMERIC precision limits:
-- NUMERIC: 38 digits precision, 9 decimal places
NUMERIC(38, 9)
-- BIGNUMERIC: 76 digits precision, 38 decimal places
BIGNUMERIC(76, 38)
-- Example
SELECT
CAST('12345678901234567890.123456789' AS NUMERIC) AS num,
CAST('12345678901234567890.123456789' AS BIGNUMERIC) AS bignum;
TIMESTAMP vs DATETIME vs DATE:
-- TIMESTAMP: UTC, timezone-aware
SELECT CURRENT_TIMESTAMP(); -- 2024-01-15 10:30:45.123456 UTC
-- DATETIME: No timezone
SELECT CURRENT_DATETIME(); -- 2024-01-15 10:30:45.123456
-- DATE: Date only
SELECT CURRENT_DATE(); -- 2024-01-15
-- Conversion
SELECT
TIMESTAMP('2024-01-15 10:30:45'), -- Assumes UTC
DATETIME(TIMESTAMP '2024-01-15 10:30:45'), -- Loses timezone
DATE(TIMESTAMP '2024-01-15 10:30:45'); -- Loses time
Type coercion in JOINs:
-- ❌ Implicit cast can prevent optimization
SELECT * FROM table1 t1
JOIN table2 t2
ON t1.id = CAST(t2.id AS STRING); -- Prevents clustering optimization
-- ✅ Match types explicitly
SELECT * FROM table1 t1
JOIN table2 t2
ON t1.id = t2.id; -- Both STRING
Window functions perform calculations across a set of rows related to the current row, without collapsing the result set.
<function> OVER (
[PARTITION BY partition_expression]
[ORDER BY sort_expression]
[window_frame_clause]
)
ROW_NUMBER() - Sequential numbering:
SELECT
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num
FROM orders;
Common use: Deduplication
SELECT * EXCEPT(row_num)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num
FROM customers
)
WHERE row_num = 1;
RANK() - Rank with gaps:
SELECT
product_name,
revenue,
RANK() OVER (ORDER BY revenue DESC) AS rank
FROM products;
-- Results:
-- Product A: $1000, rank 1
-- Product B: $900, rank 2
-- Product C: $900, rank 2 (tie)
-- Product D: $800, rank 4 (gap after tie)
DENSE_RANK() - Rank without gaps:
SELECT
product_name,
revenue,
DENSE_RANK() OVER (ORDER BY revenue DESC) AS dense_rank
FROM products;
-- Results:
-- Product A: $1000, rank 1
-- Product B: $900, rank 2
-- Product C: $900, rank 2 (tie)
-- Product D: $800, rank 3 (no gap)
NTILE() - Divide into N buckets:
SELECT
customer_id,
total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile
FROM customer_totals;
LEAD() and LAG() - Access rows before/after:
-- Time series analysis
SELECT
date,
revenue,
LAG(revenue) OVER (ORDER BY date) AS prev_day_revenue,
LEAD(revenue) OVER (ORDER BY date) AS next_day_revenue,
revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change,
ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY date)) / LAG(revenue) OVER (ORDER BY date), 2) AS pct_change
FROM daily_sales
ORDER BY date;
With PARTITION BY:
-- Per-customer analysis
SELECT
customer_id,
order_date,
amount,
LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_amount,
amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS amount_diff
FROM orders;
FIRST_VALUE() and LAST_VALUE():
SELECT
date,
revenue,
FIRST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_day_revenue,
LAST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_day_revenue
FROM daily_sales;
NTH_VALUE() - Get Nth value:
SELECT
product_id,
date,
sales,
NTH_VALUE(sales, 2) OVER (PARTITION BY product_id ORDER BY date) AS second_day_sales
FROM product_sales;
SUM/AVG/COUNT as window functions:
SELECT
date,
revenue,
SUM(revenue) OVER (ORDER BY date) AS cumulative_revenue,
AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7day,
COUNT(*) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30day_count
FROM daily_sales;
Running totals:
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_lifetime_value
FROM orders;
QUALIFY filters on window function results - no subquery needed!
❌ Standard SQL (verbose):
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num
FROM orders
)
WHERE row_num = 1;
✅ BigQuery with QUALIFY (clean):
SELECT customer_id, order_date, amount
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;
More QUALIFY examples:
-- Get top 3 products per category
SELECT category, product_name, revenue
FROM products
QUALIFY RANK() OVER (PARTITION BY category ORDER BY revenue DESC) <= 3;
-- Filter outliers (keep middle 80%)
SELECT *
FROM measurements
QUALIFY NTILE(10) OVER (ORDER BY value) BETWEEN 2 AND 9;
-- Get first order per customer
SELECT customer_id, order_date, amount
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) = 1;
Control which rows are included in the window:
-- ROWS: Physical row count
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW -- Last 4 rows including current
-- RANGE: Logical range (based on ORDER BY values)
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW -- Last 7 days
-- Examples:
SELECT
date,
sales,
-- Last 7 rows
AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7rows,
-- Last 7 days (logical)
AVG(sales) OVER (ORDER BY date RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW) AS avg_7days,
-- All preceding rows
SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum,
-- Centered window (3 before, 3 after)
AVG(sales) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS centered_avg
FROM daily_sales;
1. Partition appropriately:
-- ✅ Good: Partitions reduce data scanned
SELECT *
FROM events
WHERE date >= '2024-01-01' -- Partition filter
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) = 1;
2. Avoid window functions in WHERE:
-- ❌ Wrong: Can't use window functions in WHERE
SELECT * FROM orders
WHERE ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1; -- ERROR
-- ✅ Use QUALIFY instead
SELECT * FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;
3. Reuse window definitions:
SELECT
date,
revenue,
ROW_NUMBER() OVER w AS row_num,
RANK() OVER w AS rank,
AVG(revenue) OVER w AS avg_revenue
FROM sales
WINDOW w AS (PARTITION BY category ORDER BY revenue DESC);
BigQuery provides rich functions for parsing, extracting, and manipulating JSON data.
JSON_VALUE() - Extract scalar values (Standard SQL):
SELECT
JSON_VALUE(json_column, '$.user.name') AS user_name,
JSON_VALUE(json_column, '$.user.email') AS email,
CAST(JSON_VALUE(json_column, '$.amount') AS FLOAT64) AS amount,
CAST(JSON_VALUE(json_column, '$.quantity') AS INT64) AS quantity
FROM events;
JSON_QUERY() - Extract objects or arrays:
SELECT
JSON_QUERY(json_column, '$.user') AS user_object,
JSON_QUERY(json_column, '$.items') AS items_array
FROM events;
JSON_EXTRACT() - Legacy, still widely used:
SELECT
JSON_EXTRACT(json_column, '$.user.name') AS user_name,
JSON_EXTRACT_SCALAR(json_column, '$.user.email') AS email -- Returns STRING
FROM events;
JSONPath syntax:
-- Dot notation
'$.user.name'
-- Array index
'$.items[0].product_id'
-- Array slice
'$.items[0:3]'
-- Wildcard
'$.users[*].name'
-- Recursive descent
'$..name' -- All 'name' fields at any level
JSON_EXTRACT_ARRAY() - Extract array elements:
SELECT
event_id,
tag
FROM events,
UNNEST(JSON_EXTRACT_ARRAY(tags_json, '$')) AS tag;
JSON_VALUE_ARRAY() - Extract array of scalars:
SELECT
product_id,
tag
FROM products,
UNNEST(JSON_VALUE_ARRAY(tags_json, '$')) AS tag;
Complete example:
-- JSON: {"product_id": "A123", "tags": ["electronics", "sale", "featured"]}
SELECT
JSON_VALUE(product_json, '$.product_id') AS product_id,
tag
FROM products,
UNNEST(JSON_VALUE_ARRAY(product_json, '$.tags')) AS tag;
-- Results:
-- A123, electronics
-- A123, sale
-- A123, featured
TO_JSON_STRING() - Convert to JSON:
SELECT
customer_id,
TO_JSON_STRING(STRUCT(
name,
email,
created_at
)) AS customer_json
FROM customers;
Create JSON objects:
SELECT
TO_JSON_STRING(STRUCT(
'John' AS name,
30 AS age,
['reading', 'hiking'] AS hobbies,
STRUCT('123 Main St' AS street, 'Boston' AS city) AS address
)) AS person_json;
-- Result:
-- {"name":"John","age":30,"hobbies":["reading","hiking"],"address":{"street":"123 Main St","city":"Boston"}}
Aggregate into JSON:
SELECT
customer_id,
TO_JSON_STRING(
ARRAY_AGG(
STRUCT(order_id, amount, date)
ORDER BY date DESC
LIMIT 5
)
) AS recent_orders_json
FROM orders
GROUP BY customer_id;
PARSE_JSON() - Convert string to JSON:
SELECT
JSON_VALUE(PARSE_JSON('{"name":"Alice","age":25}'), '$.name') AS name;
Safe JSON parsing (avoid errors):
SELECT
SAFE.JSON_VALUE(invalid_json, '$.name') AS name -- Returns NULL on error
FROM events;
Nested JSON extraction:
-- JSON structure:
-- {
-- "order": {
-- "id": "ORD123",
-- "items": [
-- {"product": "A", "qty": 2, "price": 10.50},
-- {"product": "B", "qty": 1, "price": 25.00}
-- ]
-- }
-- }
SELECT
JSON_VALUE(data, '$.order.id') AS order_id,
JSON_VALUE(item, '$.product') AS product,
CAST(JSON_VALUE(item, '$.qty') AS INT64) AS quantity,
CAST(JSON_VALUE(item, '$.price') AS FLOAT64) AS price
FROM events,
UNNEST(JSON_EXTRACT_ARRAY(data, '$.order.items')) AS item;
Transform relational data to JSON:
SELECT
category,
TO_JSON_STRING(
STRUCT(
category AS category_name,
COUNT(*) AS product_count,
ROUND(AVG(price), 2) AS avg_price,
ARRAY_AGG(
STRUCT(product_name, price)
ORDER BY price DESC
LIMIT 3
) AS top_products
)
) AS category_summary
FROM products
GROUP BY category;
1. Extract once, reuse:
-- ❌ Bad: Multiple extractions
SELECT
JSON_VALUE(data, '$.user.id'),
JSON_VALUE(data, '$.user.name'),
JSON_VALUE(data, '$.user.email')
FROM events;
-- ✅ Better: Extract object once
WITH extracted AS (
SELECT JSON_QUERY(data, '$.user') AS user_json
FROM events
)
SELECT
JSON_VALUE(user_json, '$.id'),
JSON_VALUE(user_json, '$.name'),
JSON_VALUE(user_json, '$.email')
FROM extracted;
2. Consider STRUCT columns instead of JSON:
-- If schema is known and stable, use STRUCT
CREATE TABLE events (
user STRUCT<id STRING, name STRING, email STRING>,
timestamp TIMESTAMP
);
-- Query with dot notation (faster than JSON extraction)
SELECT user.id, user.name, user.email
FROM events;
3. Materialize frequently accessed JSON fields:
-- Add extracted columns to table
ALTER TABLE events
ADD COLUMN user_id STRING AS (JSON_VALUE(data, '$.user.id'));
-- Now queries can filter efficiently
SELECT * FROM events WHERE user_id = 'U123';
BigQuery's native support for nested and repeated data allows for powerful denormalization and performance optimization.
Creating arrays:
SELECT
[1, 2, 3, 4, 5] AS numbers,
['apple', 'banana', 'cherry'] AS fruits,
[DATE '2024-01-01', DATE '2024-01-02'] AS dates;
ARRAY_AGG() - Aggregate into array:
SELECT
customer_id,
ARRAY_AGG(order_id ORDER BY order_date DESC) AS order_ids,
ARRAY_AGG(amount) AS order_amounts
FROM orders
GROUP BY customer_id;
With LIMIT:
SELECT
customer_id,
ARRAY_AGG(order_id ORDER BY order_date DESC LIMIT 5) AS recent_order_ids
FROM orders
GROUP BY customer_id;
Basic UNNEST:
SELECT element
FROM UNNEST(['a', 'b', 'c']) AS element;
-- Results:
-- a
-- b
-- c
UNNEST with table:
-- Table: customers
-- customer_id | order_ids
-- 1 | [101, 102, 103]
-- 2 | [201, 202]
SELECT
customer_id,
order_id
FROM customers,
UNNEST(order_ids) AS order_id;
-- Results:
-- 1, 101
-- 1, 102
-- 1, 103
-- 2, 201
-- 2, 202
UNNEST with OFFSET (get array index):
SELECT
item,
idx
FROM UNNEST(['first', 'second', 'third']) AS item WITH OFFSET AS idx;
-- Results:
-- first, 0
-- second, 1
-- third, 2
Creating structs:
SELECT
STRUCT('John' AS name, 30 AS age, 'Engineer' AS role) AS person,
STRUCT('123 Main St' AS street, 'Boston' AS city, '02101' AS zip) AS address;
Querying struct fields:
SELECT
person.name,
person.age,
address.city
FROM (
SELECT
STRUCT('John' AS name, 30 AS age) AS person,
STRUCT('Boston' AS city) AS address
);
Create:
SELECT
customer_id,
ARRAY_AGG(
STRUCT(
order_id,
amount,
order_date,
status
)
ORDER BY order_date DESC
) AS orders
FROM orders
GROUP BY customer_id;
Query:
-- Flatten array of struct
SELECT
customer_id,
order.order_id,
order.amount,
order.order_date
FROM customers,
UNNEST(orders) AS order
WHERE order.status = 'completed';
Filter array elements:
SELECT
customer_id,
ARRAY(
SELECT AS STRUCT order_id, amount
FROM UNNEST(orders) AS order
WHERE order.status = 'completed'
ORDER BY amount DESC
LIMIT 3
) AS top_completed_orders
FROM customers;
ARRAY_LENGTH():
SELECT
customer_id,
ARRAY_LENGTH(order_ids) AS total_orders
FROM customers;
ARRAY_CONCAT():
SELECT ARRAY_CONCAT([1, 2], [3, 4], [5]) AS combined;
-- Result: [1, 2, 3, 4, 5]
ARRAY_TO_STRING():
SELECT ARRAY_TO_STRING(['apple', 'banana', 'cherry'], ', ') AS fruits;
-- Result: 'apple, banana, cherry'
GENERATE_ARRAY():
SELECT GENERATE_ARRAY(1, 10) AS numbers;
-- Result: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
SELECT GENERATE_ARRAY(0, 100, 10) AS multiples_of_10;
-- Result: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
ARRAY_REVERSE():
SELECT ARRAY_REVERSE([1, 2, 3, 4, 5]) AS reversed;
-- Result: [5, 4, 3, 2, 1]
Traditional approach (2 tables, JOIN):
-- Table 1: customers (1M rows)
-- Table 2: orders (10M rows)
SELECT
c.customer_id,
c.name,
o.order_id,
o.amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.customer_id = '12345';
-- Scans: customers (1M) + orders (10M) = 11M rows
-- Join cost: High
Array approach (1 table with ARRAY):
-- Table: customers (1M rows with nested orders array)
SELECT
customer_id,
name,
order.order_id,
order.amount
FROM customers,
UNNEST(orders) AS order
WHERE customer_id = '12345';
-- Scans: customers (1M) only
-- No join cost
-- 50-80% faster for 1:many relationships
When to use ARRAY:
When NOT to use ARRAY:
Traditional normalized:
-- 3 tables, 2 JOINs
SELECT
c.customer_id,
c.name,
o.order_id,
oi.product_id,
oi.quantity
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id;
Denormalized with ARRAY/STRUCT:
-- 1 table, no JOINs
CREATE TABLE customers_denormalized AS
SELECT
c.customer_id,
c.name,
ARRAY_AGG(
STRUCT(
o.order_id,
o.order_date,
o.status,
ARRAY(
SELECT AS STRUCT product_id, quantity, price
FROM order_items
WHERE order_id = o.order_id
) AS items
)
ORDER BY o.order_date DESC
) AS orders
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;
-- Query (no JOINs!)
SELECT
customer_id,
name,
order.order_id,
item.product_id,
item.quantity
FROM customers_denormalized,
UNNEST(orders) AS order,
UNNEST(order.items) AS item
WHERE customer_id = '12345';
Performance improvement: 3-5x faster for typical queries.
EXCEPT - Exclude columns:
-- Select all except sensitive columns
SELECT * EXCEPT(ssn, password, credit_card)
FROM customers;
-- Combine with WHERE
SELECT * EXCEPT(internal_notes)
FROM orders
WHERE status = 'shipped';
REPLACE - Modify columns:
-- Replace column values
SELECT * REPLACE(UPPER(name) AS name, ROUND(price, 2) AS price)
FROM products;
-- Anonymize data
SELECT * REPLACE('***' AS ssn, '***' AS credit_card)
FROM customers;
Combine EXCEPT and REPLACE:
SELECT * EXCEPT(password) REPLACE(LOWER(email) AS email)
FROM users;
SAFE_CAST() - Returns NULL on error:
-- Regular CAST throws error on invalid input
SELECT CAST('invalid' AS INT64); -- ERROR
-- SAFE_CAST returns NULL
SELECT SAFE_CAST('invalid' AS INT64) AS result; -- NULL
SAFE_DIVIDE() - Returns NULL on division by zero:
SELECT
revenue,
orders,
SAFE_DIVIDE(revenue, orders) AS avg_order_value -- NULL if orders = 0
FROM daily_metrics;
Other SAFE functions:
-- SAFE_SUBTRACT (for dates)
SELECT SAFE_SUBTRACT(DATE '2024-01-01', DATE '2024-12-31'); -- NULL (negative)
-- SAFE_NEGATE
SELECT SAFE_NEGATE(9223372036854775807); -- NULL (overflow)
-- SAFE_ADD
SELECT SAFE_ADD(9223372036854775807, 1); -- NULL (overflow)
Use case: Data quality checks:
SELECT
COUNT(*) AS total_rows,
COUNT(SAFE_CAST(amount AS FLOAT64)) AS valid_amounts,
COUNT(*) - COUNT(SAFE_CAST(amount AS FLOAT64)) AS invalid_amounts
FROM transactions;
-- ✅ Valid: Group by column position
SELECT
customer_id,
DATE(order_date) AS order_date,
SUM(amount) AS total
FROM orders
GROUP BY 1, 2; -- Same as: GROUP BY customer_id, DATE(order_date)
-- ✅ Also valid: Mix names and numbers
SELECT
customer_id,
DATE(order_date) AS order_date,
SUM(amount) AS total
FROM orders
GROUP BY customer_id, 2;
When useful:
System sampling (fast, approximate):
-- Sample ~10% of data (by blocks)
SELECT * FROM large_table
TABLESAMPLE SYSTEM (10 PERCENT);
Use cases:
Note: SYSTEM sampling is block-based, not truly random. For exact percentages, use ROW_NUMBER() with RAND().
PIVOT - Columns to rows:
SELECT *
FROM (
SELECT product, quarter, sales
FROM quarterly_sales
)
PIVOT (
SUM(sales)
FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')
);
-- Before:
-- product | quarter | sales
-- A | Q1 | 100
-- A | Q2 | 150
-- After:
-- product | Q1 | Q2 | Q3 | Q4
-- A | 100 | 150 | ... | ...
UNPIVOT - Rows to columns:
SELECT *
FROM quarterly_totals
UNPIVOT (
sales FOR quarter IN (Q1, Q2, Q3, Q4)
);
-- Before:
-- product | Q1 | Q2 | Q3 | Q4
-- A | 100 | 150 | 200 | 250
-- After:
-- product | quarter | sales
-- A | Q1 | 100
-- A | Q2 | 150
BigQuery has two SQL dialects:
| Feature | Standard SQL | Legacy SQL |
|---|---|---|
| ANSI Compliance | ✅ Yes | ❌ No |
| Recommended | ✅ Yes | ❌ Deprecated |
| Table Reference | `project.dataset.table` | [project:dataset.table] |
| CTEs (WITH) | ✅ Yes | ❌ No |
| Window Functions | ✅ Full support | ⚠️ Limited |
| ARRAY/STRUCT | ✅ Native | ⚠️ Limited |
| Portability | ✅ High | ❌ BigQuery-only |
How to detect Legacy SQL:
-- Legacy SQL indicators:
-- 1. Square brackets for tables
SELECT * FROM [project:dataset.table]
-- 2. GROUP EACH BY
SELECT customer_id, COUNT(*) FROM orders GROUP EACH BY customer_id
-- 3. FLATTEN
SELECT * FROM FLATTEN([project:dataset.table], field)
-- 4. TABLE_DATE_RANGE
SELECT * FROM TABLE_DATE_RANGE([dataset.table_], TIMESTAMP('2024-01-01'), TIMESTAMP('2024-12-31'))
Standard SQL equivalent:
-- 1. Backticks
SELECT * FROM `project.dataset.table`
-- 2. Regular GROUP BY
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id
-- 3. UNNEST
SELECT * FROM `project.dataset.table`, UNNEST(field)
-- 4. Partitioned table filter
SELECT * FROM `project.dataset.table`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2024-01-01') AND TIMESTAMP('2024-12-31')
Migration:
# Set default to Standard SQL
bq query --use_legacy_sql=false 'SELECT ...'
# Or in Python
job_config = bigquery.QueryJobConfig(use_legacy_sql=False)
Before running expensive queries:
--dry_run to estimate cost-- Check query statistics
SELECT
job_id,
user_email,
total_bytes_processed,
total_slot_ms,
TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_sec
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY total_bytes_processed DESC
LIMIT 10;
Immediate improvements: