Skill ファイル

Big Data Analysis Skill

Name: Big Data Analysis Skill
Author: Oak-B

Use when analyzing data with Hive/Impala tables, writing SQL for data exploration, or building/deploying Spark ETL jobs on HDFS/YARN. ALWAYS trigger this skill — even if the user does not use these exact words — for any of the following: writing or reviewing a Spark Scala job, migrating SQL from Hive/Impala to Spark, creating or altering Hive tables, inserting data into partitioned tables, joining large tables in Spark SQL, using Spark UDFs, verifying table schema before coding, GROUP BY with text fields, OOM on large tables, INSERT column mismatch or silent data shifts, broadcast join stall or task explosion, DataFrame API being slow, cache() not materializing, metadata not visible after Spark write, date window off-by-one, control character regex not matching, Scala string interpolation bugs in Spark SQL, or any time the user says their Spark job is slow, wrong, or behaving unexpectedly.

Oak-B0 スター2026/04/01

職業
カテゴリ: データエンジニアリング

スキル内容

Two Modes — Know Which One You're In

Mode	When	How to behave
Analysis	Profiling, threshold selection, diversity check, strategy decisions	Run SQL → present numbers → ask user before drawing conclusions
Coding	Writing Spark jobs, ETL SQL, table creation	Follow the rules below strictly; never guess types, column order, or table names

Mode 1: Analysis — Always Interact, Never Decide Alone

When asked to analyze data or explore thresholds:

Run exploratory SQL first, present the numbers to the user
Highlight what the numbers mean (e.g., "top-100 entities account for X% of all pairs")
Ask the user before making any strategic decision: filter thresholds, cap values, window sizes, whether to apply any sampling strategy

Big Data Analysis Skill

Oak-B0 スター2026/04/01

職業
カテゴリ: データエンジニアリング

スキル内容

Two Modes — Know Which One You're In

Mode	When	How to behave
Analysis	Profiling, threshold selection, diversity check, strategy decisions	Run SQL → present numbers → ask user before drawing conclusions
Coding	Writing Spark jobs, ETL SQL, table creation	Follow the rules below strictly; never guess types, column order, or table names

Mode 1: Analysis — Always Interact, Never Decide Alone

When asked to analyze data or explore thresholds:

Run exploratory SQL first, present the numbers to the user
Highlight what the numbers mean (e.g., "top-100 entities account for X% of all pairs")
Ask the user before making any strategic decision: filter thresholds, cap values, window sizes, whether to apply any sampling strategy

関連 Skill

hive -e "DESCRIBE db.table_name"
hive -e "SHOW CREATE TABLE db.table_name"

// ❌
val df = spark.sql("SELECT * FROM some_db.some_table WHERE dt = '${dt}'")

// ✅ receive via CLI args, declare in .job config
val inputTable = cmd.getOptionValue("inputTable")
val df = spark.sql(s"SELECT * FROM $inputTable WHERE dt = '${dt}'")

-- In Hive/Impala:
WITH core AS (
    SELECT key_id, group_concat(cast(id AS string), ',') AS id_list
    FROM source_table
    GROUP BY key_id   -- ⚠️ no free-text fields here
)
SELECT core.*, regexp_replace(t.description, '[\\x00-\\x1f]', '') AS description
FROM core LEFT JOIN text_table t ON core.key_id = t.key_id AND t.dt = '${dt}';

WITH targets AS (SELECT DISTINCT entity_id FROM ... WHERE conditions),
core AS (
    SELECT t.entity_id, group_concat(cast(d.id AS string), ',') AS id_list
    FROM targets t
    INNER JOIN huge_table d ON t.entity_id = d.entity_id
    GROUP BY t.entity_id   -- aggregates only the filtered subset
)
SELECT * FROM core;  -- this is a read query, not INSERT; see Rule 6 for INSERT patterns

// ✅
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")  // see Rule 5
spark.sql(s"INSERT OVERWRITE TABLE $output PARTITION(dt='${dt}') WITH ... SELECT ...")

df.count()                                          // ✅ always works; adds one extra Job
df.write.mode("overwrite").format("noop").save("")  // ✅ Spark 3.0+ with noop DataSource on classpath; no disk I/O
df.foreachPartition((_: Iterator[Row]) => ())       // ❌ no-op — never use this

// Set before any SQL runs — has no effect if set after query planning
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

Table size estimate	Recommendation
< 1M rows, known small	Keep default; Spark auto-decides correctly
1M–5M rows, or unsure	Disable first; review Spark UI, re-enable only if safe
> 5M rows, or job shows broadcast stall	Always disable

// ❌
spark.sql(s"INSERT OVERWRITE TABLE $output PARTITION(dt='$dt') SELECT * FROM tmp")

// ✅ always list columns explicitly
spark.sql(s"""INSERT OVERWRITE TABLE $output PARTITION(dt='$dt')
             |SELECT col1, col2, col3, new_col FROM tmp""".stripMargin)

val tableColumns = spark.sql(s"DESCRIBE $output")
    .filter("col_name NOT IN ('dt', '') AND col_name NOT LIKE '#%'")
    .select("col_name").as[String].collect().toSeq

val dataColumns = result.columns.toSeq
val missingCols = tableColumns.diff(dataColumns)
// ⚠️ order by dataColumns first — NEVER reorder by tableColumns
val selectExprs = dataColumns.map(c => s"`$c`") ++ missingCols.map(c => s"NULL AS `$c`")

result.createOrReplaceTempView("tmp_result")
spark.sql(s"""INSERT OVERWRITE TABLE $output PARTITION(dt='$dt')
             |SELECT ${selectExprs.mkString(", ")} FROM tmp_result""".stripMargin)

-- ❌ rows with no description are silently dropped
... INNER JOIN meta_table meta ON core.id = meta.id

-- ✅ missing metadata becomes NULL; no rows are lost
... LEFT JOIN meta_table meta ON core.id = meta.id AND meta.dt = '${dt}'

spark.sql(s"REFRESH TABLE $output")   // or MSCK REPAIR TABLE / INVALIDATE METADATA in Hive/Impala

// ✅ Safe input parameter — Seq[String] as argument is fine
val myUdf = udf((items: Seq[String], size: Int) => {
    // ❌ Unsafe return: Array[Seq[String]]
    items.grouped(size).map(_.toSeq).toArray        // crashes at runtime

    // ✅ Safe return: flatten each group with a delimiter
    items.grouped(size).map(_.mkString("|||")).toArray
})

Symptom	Root Cause	Rule
Hard-coded table name in Spark source	Cannot deploy cross-env	Rule 1
New column all NULL / field value shifts	`SELECT *` + schema change	Rule 6
Dynamic INSERT writes wrong columns	`selectExprs` ordered by target schema, not data columns	Rule 6
45+ Jobs, 3-hour runtime	DataFrame API + multiple `count()`	Rule 4
`cache()` didn't materialize	`foreachPartition(_ => ())` is a no-op	Rule 4
Job timeout, 26k+ tasks	Medium-sized table auto-broadcast	Rule 5
Row explosion, field misalignment	Control characters in GROUP BY field	Rule 2
OOM	Direct aggregation on billion-row table	Rule 3
Silent row loss	INNER JOIN on optional field	Rule 7
Hive/Impala sees no data after write	Metadata not refreshed	Rule 8
JOIN key type mismatch → slow / wrong	Types not verified before coding	Rule 0
Control char regex broken in Spark Scala	`[\\x00-\\x1f]` in s-string; use `[\u0000-\u001f]`	spark-pitfalls §8.1
Date window off by one (13d vs 14d)	`date_sub(dt, N)` + `<= dt` gives N+1 days	spark-pitfalls §8.3
Spark SQL variable error	`$varName` near punctuation; use `${varName}`	spark-pitfalls §8.2
UDF `NoClassDefFoundError`	Returning nested Scala collections	Rule 9
DESCRIBE returns junk rows	Missing `NOT LIKE '#%'` filter on partition info rows	Rule 6
Job slower after "SQL optimization"	Extra back-JOINs cost more than the GROUP BY savings	spark-pitfalls §7.1
Wrong broadcast decision by Catalyst	Missing table statistics; run `ANALYZE TABLE`	spark-pitfalls §9

Big Data Analysis Skill

Two Modes — Know Which One You're In

Mode 1: Analysis — Always Interact, Never Decide Alone

Big Data Analysis Skill

Two Modes — Know Which One You're In

Mode 1: Analysis — Always Interact, Never Decide Alone

Mode 2: Coding — Rules That Prevent Silent Bugs

⚠️ Rule 0: DESCRIBE Before Writing Any Code

⚠️ Rule 1: Never Hard-Code Table Names in Spark Source

⚠️⚠️ Rule 2: Keep Long-Text Fields Out of GROUP BY

⚠️⚠️ Rule 3: Filter First, Then Aggregate on Large Tables

⚠️⚠️⚠️ Rule 4: Use Spark SQL, Not the DataFrame API

⚠️⚠️ Rule 5: Control Broadcast JOIN Threshold Based on Table Size

⚠️⚠️⚠️ Rule 6: Never Use SELECT * in INSERT

⚠️⚠️ Rule 7: Use LEFT JOIN for Optional Fields

⚠️ Rule 8: Refresh Metadata After Writing

⚠️ Rule 9: UDF Type Safety — Return Types Only

Quick Error Reference

Intermediate Table Hygiene

Reference

Clickhouse Io

Clickhouse Io

Claude Devfleet

Clickhouse Io

Ai First Engineering

Postgres Patterns