技能檔案

BigQuery Query Optimization

Name: BigQuery Query Optimization
Author: majiayu000

Use when writing BigQuery queries, optimizing query performance, analyzing execution plans, or avoiding common SQL gotchas. Covers parameterized queries, UDFs, scripting, window functions (QUALIFY, ROW_NUMBER, RANK, LEAD/LAG), JSON functions, ARRAY/STRUCT operations, BigQuery-specific features (EXCEPT, REPLACE, SAFE_*), CTE re-execution issues, NOT IN with NULLs, DML performance, Standard vs Legacy SQL, and performance best practices.

majiayu000200 星標2026年1月22日

職業
分類: SQL 數據庫

技能內容

Use this skill when writing, debugging, or optimizing BigQuery SQL queries for performance and efficiency.

Query Execution Analysis

Using EXPLAIN

-- Get execution plan
EXPLAIN SELECT * FROM `project.dataset.table` WHERE condition;

-- Get execution plan with runtime stats
EXPLAIN ANALYZE SELECT * FROM `project.dataset.table` WHERE condition;

What the plan shows:

Stages of execution
Bytes read per stage
Slot time consumed
Potential bottlenecks

Performance Best Practices

1. Avoid SELECT *

❌ Bad (scans all columns):

相關技能

BigQuery Query Optimization | Skills Pool

SELECT * FROM `project.dataset.large_table`

SELECT customer_id, amount, date
FROM `project.dataset.large_table`

SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
GROUP BY customer_id
HAVING SUM(amount) > 1000

SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
WHERE amount > 100  -- Filter early
GROUP BY customer_id
HAVING SUM(amount) > 1000

-- Scans entire table
SELECT * FROM `project.dataset.orders`
WHERE order_date >= '2024-01-01'

-- Only scans relevant partitions
SELECT * FROM `project.dataset.orders`
WHERE DATE(order_timestamp) >= '2024-01-01'  -- Partition column

SELECT ...
FROM (
  SELECT ...
  FROM (
    SELECT ... -- Deeply nested
  )
)
WHERE ...

WITH base_data AS (
  SELECT customer_id, amount, date
  FROM `project.dataset.orders`
  WHERE date >= '2024-01-01'
),
aggregated AS (
  SELECT customer_id, SUM(amount) as total
  FROM base_data
  GROUP BY customer_id
)
SELECT * FROM aggregated WHERE total > 1000

CREATE TEMP TABLE base_data AS
SELECT customer_id, amount, date
FROM `project.dataset.orders`
WHERE date >= '2024-01-01';

CREATE TEMP TABLE aggregated AS
SELECT customer_id, SUM(amount) as total
FROM base_data
GROUP BY customer_id;

SELECT * FROM aggregated WHERE total > 1000;

-- ✅ Large table first
SELECT l.*, s.detail
FROM `project.dataset.large_table` l
JOIN `project.dataset.small_table` s
  ON l.id = s.id

-- Instead of JOIN for 1:many relationships
SELECT
  order_id,
  ARRAY_AGG(STRUCT(product_id, quantity, price)) as items
FROM `project.dataset.order_items`
GROUP BY order_id

bq query \
  --use_legacy_sql=false \
  --parameter=start_date:DATE:2024-01-01 \
  --parameter=end_date:DATE:2024-12-31 \
  --parameter=min_amount:FLOAT64:100.0 \
  'SELECT *
   FROM `project.dataset.orders`
   WHERE order_date BETWEEN @start_date AND @end_date
   AND amount >= @min_amount'

from google.cloud import bigquery

client = bigquery.Client()

query = """
SELECT customer_id, SUM(amount) as total
FROM `project.dataset.orders`
WHERE order_date >= @start_date
GROUP BY customer_id
"""

job_config = bigquery.QueryJobConfig(
    query_parameters=[
        bigquery.ScalarQueryParameter("start_date", "DATE", "2024-01-01")
    ]
)

results = client.query_and_wait(query, job_config=job_config)

CREATE TEMP FUNCTION CleanEmail(email STRING)
RETURNS STRING
AS (
  LOWER(TRIM(email))
);

SELECT CleanEmail(customer_email) as email
FROM `project.dataset.customers`;

CREATE TEMP FUNCTION ParseUserAgent(ua STRING)
RETURNS STRUCT<browser STRING, version STRING>
LANGUAGE js AS r"""
  var match = ua.match(/(Chrome|Firefox|Safari)\/(\d+)/);
  return {
    browser: match ? match[1] : 'Unknown',
    version: match ? match[2] : '0'
  };
""";

SELECT ParseUserAgent(user_agent).browser as browser
FROM `project.dataset.sessions`;

-- Create once, use many times
CREATE FUNCTION `project.dataset.clean_email`(email STRING)
RETURNS STRING
AS (LOWER(TRIM(email)));

-- Use anywhere
SELECT `project.dataset.clean_email`(email) FROM ...

DECLARE total_orders INT64;
SET total_orders = (SELECT COUNT(*) FROM `project.dataset.orders`);

SELECT total_orders;

DECLARE x INT64 DEFAULT 0;
LOOP
  SET x = x + 1;
  IF x >= 10 THEN
    LEAVE;
  END IF;
END LOOP;

DECLARE x INT64 DEFAULT 0;
WHILE x < 10 DO
  SET x = x + 1;
END WHILE;

DECLARE ids ARRAY<STRING>;
SET ids = ['id1', 'id2', 'id3'];

FOR item IN (SELECT * FROM UNNEST(ids) as id)
DO
  -- Process each id
  SELECT id;
END FOR;

bq query --use_cache=false 'SELECT...'

-- LIMIT doesn't reduce data scanned or cost!
SELECT * FROM `project.dataset.huge_table` LIMIT 10

-- Prevents partition pruning
WHERE CAST(date_column AS STRING) = '2024-01-01'

WHERE date_column = DATE('2024-01-01')

-- Cartesian product = huge result
SELECT * FROM table1 CROSS JOIN table2

-- Runs subquery for each row
SELECT *
FROM orders o
WHERE amount > (SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id)

SELECT *
FROM (
  SELECT *, AVG(amount) OVER (PARTITION BY customer_id) as avg_amount
  FROM orders
)
WHERE amount > avg_amount

WITH expensive_cte AS (
  SELECT * FROM `project.dataset.huge_table`
  WHERE complex_conditions
  AND lots_of_joins
)
SELECT COUNT(*) FROM expensive_cte
UNION ALL
SELECT SUM(amount) FROM expensive_cte
UNION ALL
SELECT MAX(date) FROM expensive_cte;

CREATE TEMP TABLE expensive_data AS
SELECT * FROM `project.dataset.huge_table`
WHERE complex_conditions
AND lots_of_joins;

SELECT COUNT(*) FROM expensive_data
UNION ALL
SELECT SUM(amount) FROM expensive_data
UNION ALL
SELECT MAX(date) FROM expensive_data;

SELECT * FROM customers
WHERE customer_id NOT IN (
  SELECT customer_id FROM blocked_customers  -- If ANY NULL, returns 0 rows!
);

SELECT * FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM blocked_customers b
  WHERE b.customer_id = c.customer_id
);

SELECT * FROM customers
WHERE customer_id NOT IN (
  SELECT customer_id
  FROM blocked_customers
  WHERE customer_id IS NOT NULL  -- Explicit NULL filter
);

SELECT c.*
FROM customers c
LEFT JOIN blocked_customers b
  ON c.customer_id = b.customer_id
WHERE b.customer_id IS NULL;

-- Don't do this - takes minutes/hours
UPDATE `project.dataset.orders`
SET status = 'processed'
WHERE order_id = '12345';

-- This is even worse - runs once per row
FOR record IN (SELECT order_id FROM orders_to_update)
DO
  UPDATE orders SET status = 'processed' WHERE order_id = record.order_id;
END FOR;

UPDATE `project.dataset.orders`
SET status = 'processed'
WHERE order_id IN (SELECT order_id FROM orders_to_update);

CREATE OR REPLACE TABLE `project.dataset.orders` AS
SELECT
  * EXCEPT(status),
  CASE
    WHEN order_id IN (SELECT order_id FROM orders_to_update)
      THEN 'processed'
    ELSE status
  END AS status
FROM `project.dataset.orders`;

MERGE `project.dataset.customers` AS target
USING `project.dataset.customer_updates` AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET name = source.name, updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, created_at)
  VALUES (customer_id, name, CURRENT_TIMESTAMP());

-- All of these are the same: INT64
CREATE TABLE example (
  col1 INT64,      -- ✅ Explicit
  col2 INTEGER,    -- Converted to INT64
  col3 INT,        -- Converted to INT64
  col4 SMALLINT,   -- Converted to INT64
  col5 BIGINT      -- Converted to INT64
);

-- PostgreSQL
CREATE TABLE users (id UUID);

-- BigQuery
CREATE TABLE users (id STRING);  -- Store UUID as string

-- NUMERIC: 38 digits precision, 9 decimal places
NUMERIC(38, 9)

-- BIGNUMERIC: 76 digits precision, 38 decimal places
BIGNUMERIC(76, 38)

-- Example
SELECT
  CAST('12345678901234567890.123456789' AS NUMERIC) AS num,
  CAST('12345678901234567890.123456789' AS BIGNUMERIC) AS bignum;

-- TIMESTAMP: UTC, timezone-aware
SELECT CURRENT_TIMESTAMP();  -- 2024-01-15 10:30:45.123456 UTC

-- DATETIME: No timezone
SELECT CURRENT_DATETIME();   -- 2024-01-15 10:30:45.123456

-- DATE: Date only
SELECT CURRENT_DATE();       -- 2024-01-15

-- Conversion
SELECT
  TIMESTAMP('2024-01-15 10:30:45'),          -- Assumes UTC
  DATETIME(TIMESTAMP '2024-01-15 10:30:45'), -- Loses timezone
  DATE(TIMESTAMP '2024-01-15 10:30:45');     -- Loses time

-- ❌ Implicit cast can prevent optimization
SELECT * FROM table1 t1
JOIN table2 t2
  ON t1.id = CAST(t2.id AS STRING);  -- Prevents clustering optimization

-- ✅ Match types explicitly
SELECT * FROM table1 t1
JOIN table2 t2
  ON t1.id = t2.id;  -- Both STRING

<function> OVER (
  [PARTITION BY partition_expression]
  [ORDER BY sort_expression]
  [window_frame_clause]
)

SELECT
  customer_id,
  order_date,
  amount,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num
FROM orders;

SELECT * EXCEPT(row_num)
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num
  FROM customers
)
WHERE row_num = 1;

SELECT
  product_name,
  revenue,
  RANK() OVER (ORDER BY revenue DESC) AS rank
FROM products;

-- Results:
-- Product A: $1000, rank 1
-- Product B: $900,  rank 2
-- Product C: $900,  rank 2  (tie)
-- Product D: $800,  rank 4  (gap after tie)

SELECT
  product_name,
  revenue,
  DENSE_RANK() OVER (ORDER BY revenue DESC) AS dense_rank
FROM products;

-- Results:
-- Product A: $1000, rank 1
-- Product B: $900,  rank 2
-- Product C: $900,  rank 2  (tie)
-- Product D: $800,  rank 3  (no gap)

SELECT
  customer_id,
  total_spend,
  NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile
FROM customer_totals;

-- Time series analysis
SELECT
  date,
  revenue,
  LAG(revenue) OVER (ORDER BY date) AS prev_day_revenue,
  LEAD(revenue) OVER (ORDER BY date) AS next_day_revenue,
  revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change,
  ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY date)) / LAG(revenue) OVER (ORDER BY date), 2) AS pct_change
FROM daily_sales
ORDER BY date;

-- Per-customer analysis
SELECT
  customer_id,
  order_date,
  amount,
  LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_amount,
  amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS amount_diff
FROM orders;

SELECT
  date,
  revenue,
  FIRST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_day_revenue,
  LAST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_day_revenue
FROM daily_sales;

SELECT
  product_id,
  date,
  sales,
  NTH_VALUE(sales, 2) OVER (PARTITION BY product_id ORDER BY date) AS second_day_sales
FROM product_sales;

SELECT
  date,
  revenue,
  SUM(revenue) OVER (ORDER BY date) AS cumulative_revenue,
  AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7day,
  COUNT(*) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30day_count
FROM daily_sales;

SELECT
  customer_id,
  order_date,
  amount,
  SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_lifetime_value
FROM orders;

SELECT *
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num
  FROM orders
)
WHERE row_num = 1;

SELECT customer_id, order_date, amount
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;

-- Get top 3 products per category
SELECT category, product_name, revenue
FROM products
QUALIFY RANK() OVER (PARTITION BY category ORDER BY revenue DESC) <= 3;

-- Filter outliers (keep middle 80%)
SELECT *
FROM measurements
QUALIFY NTILE(10) OVER (ORDER BY value) BETWEEN 2 AND 9;

-- Get first order per customer
SELECT customer_id, order_date, amount
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) = 1;

-- ROWS: Physical row count
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW  -- Last 4 rows including current

-- RANGE: Logical range (based on ORDER BY values)
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW  -- Last 7 days

-- Examples:
SELECT
  date,
  sales,
  -- Last 7 rows
  AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7rows,

  -- Last 7 days (logical)
  AVG(sales) OVER (ORDER BY date RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW) AS avg_7days,

  -- All preceding rows
  SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum,

  -- Centered window (3 before, 3 after)
  AVG(sales) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS centered_avg
FROM daily_sales;

-- ✅ Good: Partitions reduce data scanned
SELECT *
FROM events
WHERE date >= '2024-01-01'  -- Partition filter
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) = 1;

-- ❌ Wrong: Can't use window functions in WHERE
SELECT * FROM orders
WHERE ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;  -- ERROR

-- ✅ Use QUALIFY instead
SELECT * FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;

SELECT
  date,
  revenue,
  ROW_NUMBER() OVER w AS row_num,
  RANK() OVER w AS rank,
  AVG(revenue) OVER w AS avg_revenue
FROM sales
WINDOW w AS (PARTITION BY category ORDER BY revenue DESC);

SELECT
  JSON_VALUE(json_column, '$.user.name') AS user_name,
  JSON_VALUE(json_column, '$.user.email') AS email,
  CAST(JSON_VALUE(json_column, '$.amount') AS FLOAT64) AS amount,
  CAST(JSON_VALUE(json_column, '$.quantity') AS INT64) AS quantity
FROM events;

SELECT
  JSON_QUERY(json_column, '$.user') AS user_object,
  JSON_QUERY(json_column, '$.items') AS items_array
FROM events;

SELECT
  JSON_EXTRACT(json_column, '$.user.name') AS user_name,
  JSON_EXTRACT_SCALAR(json_column, '$.user.email') AS email  -- Returns STRING
FROM events;

-- Dot notation
'$.user.name'

-- Array index
'$.items[0].product_id'

-- Array slice
'$.items[0:3]'

-- Wildcard
'$.users[*].name'

-- Recursive descent
'$..name'  -- All 'name' fields at any level

SELECT
  event_id,
  tag
FROM events,
UNNEST(JSON_EXTRACT_ARRAY(tags_json, '$')) AS tag;

SELECT
  product_id,
  tag
FROM products,
UNNEST(JSON_VALUE_ARRAY(tags_json, '$')) AS tag;

-- JSON: {"product_id": "A123", "tags": ["electronics", "sale", "featured"]}
SELECT
  JSON_VALUE(product_json, '$.product_id') AS product_id,
  tag
FROM products,
UNNEST(JSON_VALUE_ARRAY(product_json, '$.tags')) AS tag;

-- Results:
-- A123, electronics
-- A123, sale
-- A123, featured

SELECT
  customer_id,
  TO_JSON_STRING(STRUCT(
    name,
    email,
    created_at
  )) AS customer_json
FROM customers;

SELECT
  TO_JSON_STRING(STRUCT(
    'John' AS name,
    30 AS age,
    ['reading', 'hiking'] AS hobbies,
    STRUCT('123 Main St' AS street, 'Boston' AS city) AS address
  )) AS person_json;

-- Result:
-- {"name":"John","age":30,"hobbies":["reading","hiking"],"address":{"street":"123 Main St","city":"Boston"}}

SELECT
  customer_id,
  TO_JSON_STRING(
    ARRAY_AGG(
      STRUCT(order_id, amount, date)
      ORDER BY date DESC
      LIMIT 5
    )
  ) AS recent_orders_json
FROM orders
GROUP BY customer_id;

SELECT
  JSON_VALUE(PARSE_JSON('{"name":"Alice","age":25}'), '$.name') AS name;

SELECT
  SAFE.JSON_VALUE(invalid_json, '$.name') AS name  -- Returns NULL on error
FROM events;

-- JSON structure:
-- {
--   "order": {
--     "id": "ORD123",
--     "items": [
--       {"product": "A", "qty": 2, "price": 10.50},
--       {"product": "B", "qty": 1, "price": 25.00}
--     ]
--   }
-- }

SELECT
  JSON_VALUE(data, '$.order.id') AS order_id,
  JSON_VALUE(item, '$.product') AS product,
  CAST(JSON_VALUE(item, '$.qty') AS INT64) AS quantity,
  CAST(JSON_VALUE(item, '$.price') AS FLOAT64) AS price
FROM events,
UNNEST(JSON_EXTRACT_ARRAY(data, '$.order.items')) AS item;

SELECT
  category,
  TO_JSON_STRING(
    STRUCT(
      category AS category_name,
      COUNT(*) AS product_count,
      ROUND(AVG(price), 2) AS avg_price,
      ARRAY_AGG(
        STRUCT(product_name, price)
        ORDER BY price DESC
        LIMIT 3
      ) AS top_products
    )
  ) AS category_summary
FROM products
GROUP BY category;

-- ❌ Bad: Multiple extractions
SELECT
  JSON_VALUE(data, '$.user.id'),
  JSON_VALUE(data, '$.user.name'),
  JSON_VALUE(data, '$.user.email')
FROM events;

-- ✅ Better: Extract object once
WITH extracted AS (
  SELECT JSON_QUERY(data, '$.user') AS user_json
  FROM events
)
SELECT
  JSON_VALUE(user_json, '$.id'),
  JSON_VALUE(user_json, '$.name'),
  JSON_VALUE(user_json, '$.email')
FROM extracted;

-- If schema is known and stable, use STRUCT
CREATE TABLE events (
  user STRUCT<id STRING, name STRING, email STRING>,
  timestamp TIMESTAMP
);

-- Query with dot notation (faster than JSON extraction)
SELECT user.id, user.name, user.email
FROM events;

-- Add extracted columns to table
ALTER TABLE events
ADD COLUMN user_id STRING AS (JSON_VALUE(data, '$.user.id'));

-- Now queries can filter efficiently
SELECT * FROM events WHERE user_id = 'U123';

SELECT
  [1, 2, 3, 4, 5] AS numbers,
  ['apple', 'banana', 'cherry'] AS fruits,
  [DATE '2024-01-01', DATE '2024-01-02'] AS dates;

SELECT
  customer_id,
  ARRAY_AGG(order_id ORDER BY order_date DESC) AS order_ids,
  ARRAY_AGG(amount) AS order_amounts
FROM orders
GROUP BY customer_id;

SELECT
  customer_id,
  ARRAY_AGG(order_id ORDER BY order_date DESC LIMIT 5) AS recent_order_ids
FROM orders
GROUP BY customer_id;

SELECT element
FROM UNNEST(['a', 'b', 'c']) AS element;

-- Results:
-- a
-- b
-- c

-- Table: customers
-- customer_id | order_ids
-- 1           | [101, 102, 103]
-- 2           | [201, 202]

SELECT
  customer_id,
  order_id
FROM customers,
UNNEST(order_ids) AS order_id;

-- Results:
-- 1, 101
-- 1, 102
-- 1, 103
-- 2, 201
-- 2, 202

SELECT
  item,
  idx
FROM UNNEST(['first', 'second', 'third']) AS item WITH OFFSET AS idx;

-- Results:
-- first, 0
-- second, 1
-- third, 2

SELECT
  STRUCT('John' AS name, 30 AS age, 'Engineer' AS role) AS person,
  STRUCT('123 Main St' AS street, 'Boston' AS city, '02101' AS zip) AS address;

SELECT
  person.name,
  person.age,
  address.city
FROM (
  SELECT
    STRUCT('John' AS name, 30 AS age) AS person,
    STRUCT('Boston' AS city) AS address
);

SELECT
  customer_id,
  ARRAY_AGG(
    STRUCT(
      order_id,
      amount,
      order_date,
      status
    )
    ORDER BY order_date DESC
  ) AS orders
FROM orders
GROUP BY customer_id;

-- Flatten array of struct
SELECT
  customer_id,
  order.order_id,
  order.amount,
  order.order_date
FROM customers,
UNNEST(orders) AS order
WHERE order.status = 'completed';

SELECT
  customer_id,
  ARRAY(
    SELECT AS STRUCT order_id, amount
    FROM UNNEST(orders) AS order
    WHERE order.status = 'completed'
    ORDER BY amount DESC
    LIMIT 3
  ) AS top_completed_orders
FROM customers;

SELECT
  customer_id,
  ARRAY_LENGTH(order_ids) AS total_orders
FROM customers;

SELECT ARRAY_CONCAT([1, 2], [3, 4], [5]) AS combined;
-- Result: [1, 2, 3, 4, 5]

SELECT ARRAY_TO_STRING(['apple', 'banana', 'cherry'], ', ') AS fruits;
-- Result: 'apple, banana, cherry'

SELECT GENERATE_ARRAY(1, 10) AS numbers;
-- Result: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

SELECT GENERATE_ARRAY(0, 100, 10) AS multiples_of_10;
-- Result: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

SELECT ARRAY_REVERSE([1, 2, 3, 4, 5]) AS reversed;
-- Result: [5, 4, 3, 2, 1]

-- Table 1: customers (1M rows)
-- Table 2: orders (10M rows)

SELECT
  c.customer_id,
  c.name,
  o.order_id,
  o.amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.customer_id = '12345';

-- Scans: customers (1M) + orders (10M) = 11M rows
-- Join cost: High

-- Table: customers (1M rows with nested orders array)

SELECT
  customer_id,
  name,
  order.order_id,
  order.amount
FROM customers,
UNNEST(orders) AS order
WHERE customer_id = '12345';

-- Scans: customers (1M) only
-- No join cost
-- 50-80% faster for 1:many relationships

-- 3 tables, 2 JOINs
SELECT
  c.customer_id,
  c.name,
  o.order_id,
  oi.product_id,
  oi.quantity
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id;

-- 1 table, no JOINs
CREATE TABLE customers_denormalized AS
SELECT
  c.customer_id,
  c.name,
  ARRAY_AGG(
    STRUCT(
      o.order_id,
      o.order_date,
      o.status,
      ARRAY(
        SELECT AS STRUCT product_id, quantity, price
        FROM order_items
        WHERE order_id = o.order_id
      ) AS items
    )
    ORDER BY o.order_date DESC
  ) AS orders
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;

-- Query (no JOINs!)
SELECT
  customer_id,
  name,
  order.order_id,
  item.product_id,
  item.quantity
FROM customers_denormalized,
UNNEST(orders) AS order,
UNNEST(order.items) AS item
WHERE customer_id = '12345';

-- Select all except sensitive columns
SELECT * EXCEPT(ssn, password, credit_card)
FROM customers;

-- Combine with WHERE
SELECT * EXCEPT(internal_notes)
FROM orders
WHERE status = 'shipped';

-- Replace column values
SELECT * REPLACE(UPPER(name) AS name, ROUND(price, 2) AS price)
FROM products;

-- Anonymize data
SELECT * REPLACE('***' AS ssn, '***' AS credit_card)
FROM customers;

SELECT * EXCEPT(password) REPLACE(LOWER(email) AS email)
FROM users;

-- Regular CAST throws error on invalid input
SELECT CAST('invalid' AS INT64);  -- ERROR

-- SAFE_CAST returns NULL
SELECT SAFE_CAST('invalid' AS INT64) AS result;  -- NULL

SELECT
  revenue,
  orders,
  SAFE_DIVIDE(revenue, orders) AS avg_order_value  -- NULL if orders = 0
FROM daily_metrics;

-- SAFE_SUBTRACT (for dates)
SELECT SAFE_SUBTRACT(DATE '2024-01-01', DATE '2024-12-31');  -- NULL (negative)

-- SAFE_NEGATE
SELECT SAFE_NEGATE(9223372036854775807);  -- NULL (overflow)

-- SAFE_ADD
SELECT SAFE_ADD(9223372036854775807, 1);  -- NULL (overflow)

SELECT
  COUNT(*) AS total_rows,
  COUNT(SAFE_CAST(amount AS FLOAT64)) AS valid_amounts,
  COUNT(*) - COUNT(SAFE_CAST(amount AS FLOAT64)) AS invalid_amounts
FROM transactions;

-- ✅ Valid: Group by column position
SELECT
  customer_id,
  DATE(order_date) AS order_date,
  SUM(amount) AS total
FROM orders
GROUP BY 1, 2;  -- Same as: GROUP BY customer_id, DATE(order_date)

-- ✅ Also valid: Mix names and numbers
SELECT
  customer_id,
  DATE(order_date) AS order_date,
  SUM(amount) AS total
FROM orders
GROUP BY customer_id, 2;

-- Sample ~10% of data (by blocks)
SELECT * FROM large_table
TABLESAMPLE SYSTEM (10 PERCENT);

SELECT *
FROM (
  SELECT product, quarter, sales
  FROM quarterly_sales
)
PIVOT (
  SUM(sales)
  FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')
);

-- Before:
-- product | quarter | sales
-- A       | Q1      | 100
-- A       | Q2      | 150

-- After:
-- product | Q1  | Q2  | Q3  | Q4
-- A       | 100 | 150 | ... | ...

SELECT *
FROM quarterly_totals
UNPIVOT (
  sales FOR quarter IN (Q1, Q2, Q3, Q4)
);

-- Before:
-- product | Q1  | Q2  | Q3  | Q4
-- A       | 100 | 150 | 200 | 250

-- After:
-- product | quarter | sales
-- A       | Q1      | 100
-- A       | Q2      | 150

Feature	Standard SQL	Legacy SQL
ANSI Compliance	✅ Yes	❌ No
Recommended	✅ Yes	❌ Deprecated
Table Reference	`project.dataset.table`	[project:dataset.table]
CTEs (WITH)	✅ Yes	❌ No
Window Functions	✅ Full support	⚠️ Limited
ARRAY/STRUCT	✅ Native	⚠️ Limited
Portability	✅ High	❌ BigQuery-only

-- Legacy SQL indicators:
-- 1. Square brackets for tables
SELECT * FROM [project:dataset.table]

-- 2. GROUP EACH BY
SELECT customer_id, COUNT(*) FROM orders GROUP EACH BY customer_id

-- 3. FLATTEN
SELECT * FROM FLATTEN([project:dataset.table], field)

-- 4. TABLE_DATE_RANGE
SELECT * FROM TABLE_DATE_RANGE([dataset.table_], TIMESTAMP('2024-01-01'), TIMESTAMP('2024-12-31'))

-- 1. Backticks
SELECT * FROM `project.dataset.table`

-- 2. Regular GROUP BY
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id

-- 3. UNNEST
SELECT * FROM `project.dataset.table`, UNNEST(field)

-- 4. Partitioned table filter
SELECT * FROM `project.dataset.table`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2024-01-01') AND TIMESTAMP('2024-12-31')

# Set default to Standard SQL
bq query --use_legacy_sql=false 'SELECT ...'

# Or in Python
job_config = bigquery.QueryJobConfig(use_legacy_sql=False)

-- Check query statistics
SELECT
  job_id,
  user_email,
  total_bytes_processed,
  total_slot_ms,
  TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_sec
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY total_bytes_processed DESC
LIMIT 10;

BigQuery Query Optimization

Query Execution Analysis

Using EXPLAIN

Performance Best Practices

1. Avoid SELECT *

BigQuery Query Optimization

Query Execution Analysis

Using EXPLAIN

Performance Best Practices

1. Avoid SELECT *

2. Filter Early and Often

3. Use Partitioned Tables

4. Break Complex Queries

5. JOIN Optimization

6. Leverage Automatic Features

Parameterized Queries

CLI Syntax

Python Syntax

User-Defined Functions (UDFs)

SQL UDF (Simple)

JavaScript UDF (Complex Logic)

Persistent UDFs

Scripting & Procedural Language

Variables

Loops

Query Caching

Common Anti-Patterns

❌ Using LIMIT to reduce cost

❌ Functions on partition columns

❌ Cross joins without filters

❌ Correlated subqueries

Common SQL Gotchas

CTE Re-execution (CRITICAL COST ISSUE)

NOT IN with NULL Values (SILENT FAILURE)

DML Statement Performance

Data Type Gotchas

Window Functions

Basic Syntax

Ranking Functions

Analytical Functions

Aggregate Window Functions

QUALIFY Clause (BigQuery-Specific)

Window Frame Clauses

Window Function Performance Tips

JSON Functions

Extracting JSON Values

Working with JSON Arrays

Creating JSON

Parsing JSON Strings

Complex JSON Examples

JSON Performance Tips

ARRAY and STRUCT

ARRAY Basics

UNNEST - Flattening Arrays

STRUCT - Nested Records

ARRAY of STRUCT (Most Powerful Pattern)

ARRAY Functions

Performance: ARRAY vs JOIN

Complete Example: Denormalized Design

BigQuery-Specific Features

EXCEPT and REPLACE in SELECT

SAFE Functions (NULL Instead of Errors)

GROUP BY with Column Numbers

TABLESAMPLE - Random Sampling

PIVOT and UNPIVOT

Standard SQL vs Legacy SQL

Performance Checklist

Monitoring Query Performance

Quick Wins

Postgres Patterns

Postgres Patterns

Database Migrations

Postgres Patterns

Postgres Patterns

Jpa Patterns