技能檔案

Migrate PySpark to SCOS

Name: Migrate PySpark to SCOS
Author: Heath-Moose

Migrate PySpark and Databricks workloads to Snowflake SCOS (Snowpark Connect for Spark). Use when: converting Spark code to run on Snowflake, analyzing PySpark compatibility, updating imports to Spark Connect equivalents, or migrating from Databricks. Triggers: migrate pyspark, convert spark, scos migration, spark connect, pyspark compatibility, snowpark connect.

Heath-Moose0 星標2026年3月26日

職業
分類: 數據庫工具

技能內容

Migrate a PySpark workload to be compatible with Snowflake SCOS (Snowpark Connect for Spark).

When to Load

[snowpark-connect] Intent Detection: After user indicates migration intent (convert, migrate, update imports, rewrite for SCOS).

Arguments

$ARGUMENTS - Path to the PySpark file or directory to migrate

Prerequisites

uv Package Manager

Check if uv is installed:

uv --version

If not installed:

curl -LsSf https://astral.sh/uv/install.sh | sh

Snowflake Connection

相關技能

Migrate PySpark to SCOS | Skills Pool

uv run --project <SKILL_DIRECTORY> \
  python <SKILL_DIRECTORY>/scripts/analyze_pyspark.py \
  --path <FILE_OR_DIR> \
  --output-format json > analysis.json

uv run --project <SKILL_DIRECTORY> \
  python -c "
from snowflake.snowpark import Session
session = Session.builder.config('connection_name', 'default').create()
result = session.sql('''
SELECT COUNT(*) as cnt FROM SCOS_MIGRATION.INFORMATION_SCHEMA.CORTEX_SEARCH_SERVICES 
WHERE SERVICE_NAME = 'SCOS_COMPAT_ISSUES_SERVICE'
''').collect()
print('EXISTS' if result[0]['CNT'] > 0 else 'NOT_FOUND')
"

The RAG knowledge base is not set up yet. I need to initialize it once.

Please provide your Snowflake warehouse name for creating the Cortex Search Service:

uv run --project <SKILL_DIRECTORY> \
  python <SKILL_DIRECTORY>/scripts/rag/scos_rag.py --warehouse <USER_PROVIDED_WAREHOUSE>

uv run --project <SKILL_DIRECTORY> \
  python <SKILL_DIRECTORY>/scripts/analyze_pyspark.py \
  --path $ARGUMENTS --output-format json --rag-backend <cortex|remote> > analysis.json

[
  {
    "file": "src/etl/transformations.py",
    "lines": "142-142",
    "code": "combined = df1.unionByName(df2, allowMissingColumns=True)",
    "final_risk": 0.4,
    "root_cause": "unionByName with allowMissingColumns may fail if there are type mismatches between corresponding columns in the two DataFrames",
    "explanation": "This code may fail if the DataFrames have columns with matching names but incompatible types. If schemas are compatible or only missing columns exist, it should work correctly.",
    "fix": "Ensure column types match between DataFrames before union, or explicitly cast columns to compatible types",
    "confidence": "MEDIUM"
  }
]

# For a single file:
cp $ARGUMENTS ${ARGUMENTS%.py}_scos.py

# For a directory:
cp -r $ARGUMENTS ${ARGUMENTS}_scos

# For a single file:
echo "${ARGUMENTS%.py}_scos.py"

# For a directory:
find ${ARGUMENTS}_scos -name "*.py" -type f | sort

File Manifest:
  ✎ src/etl/transformations.py  — 3 issues from analysis
  ✎ src/etl/loader.py           — 1 issue from analysis
  · src/utils/helpers.py         — no issues (still needs import/header updates)
  · src/config.py                — no issues (still needs import/header updates)
  · src/__init__.py              — no issues (still needs import/header updates)

Use the Tool's Fix: If the issue object provides a fix value, use it. It is tailored to the specific error.
Handle RDDs: RDD operations (final_risk near 1.0) are not supported. You MUST rewrite them using DataFrame transformations or SQL expressions. Read references/rdd-conversion.md for detailed conversion rules and examples.
Unsupported Formats: Change file formats if required (e.g., ORC/Avro -> Parquet).
No-Op Operations: Operations like hint(), repartition(), or coalesce() are silently ignored in SCOS — they have no effect but do not cause errors. Leave this code as-is without adding any comment. No code change or annotation is needed.
No-Op Configs: Spark configs that are not supported by SCOS (category: "No-Op Config") are silently ignored — they have no effect but do not cause errors. Leave this code as-is without adding any comment. No code change or annotation is needed. Common no-op configs include spark.sql.shuffle.partitions, spark.executor.memory, spark.driver.memory, spark.sql.adaptive.enabled, etc.
Missing Fixes: If fix is null, use the root_cause to determine the best workaround. If unsure, add a TODO comment: # SCOS: TODO - <explanation>.
File Reads: For file read operations (.read.csv, .read.json, .read.parquet, .load), check the path being read:
- Already using Snowflake stage (@STAGE_NAME/... or @~/...): No comment needed, this is optimal.
- External cloud storage (paths starting with s3://, s3a://, gs://, abfs://, wasb://, adl://): Add performance comment recommending Snowflake stage upload.
- Local paths or variables: If the path is a variable, trace it to determine if it's external cloud storage. Add performance comment recommending Snowflake stage upload for both.
```
# SCOS: Performance tip - Consider uploading this file to a Snowflake stage
# for faster processing. Use: session.file.put("local_path", "@STAGE_NAME/path")
df = spark.read.csv("s3://bucket/path/file.csv", header=True)
```

Snowflake Connector Pushdown (Recommended): If code uses the Spark Snowflake Connector (.format("snowflake") with .options(...) and .load()), recommend replacing it with SnowflakeSession.sql(). The connector is supported and functional in SCOS, but SnowflakeSession provides a better experience: simpler code, no connector config boilerplate, and direct use of the Snowpark Connect session. Since this is a recommendation (not a required fix), add a comment with the complete suggested replacement code while keeping the original code intact.

BEFORE:

rest_data_info = spark.read \
   .format("snowflake") \
   .options(**sfOptions) \
   .option("sfDatabase", "BRAND_PLK") \
   .option("sfSchema", "STORES") \
   .option("sfWarehouse", "ANALYSIS_PLK") \
   .option("query", f"""
       select store_id as rest_no, full_address as rest_address
       from STORES where status = 'OPEN'
   """) \
   .load()

Comment with suggested replacement:

# SCOS: Recommended improvement - The Snowflake Connector (.format("snowflake")) works
# in SCOS but SnowflakeSession.sql() provides a better experience. Suggested replacement:
#
#   from snowflake.snowpark_connect.snowflake_session import SnowflakeSession
#   snowflake_session = SnowflakeSession(spark)
#   snowflake_session.sql("USE DATABASE BRAND_PLK").collect()
#   snowflake_session.sql("USE SCHEMA STORES").collect()
#   snowflake_session.sql("USE WAREHOUSE ANALYSIS_PLK").collect()
#   rest_data_info = snowflake_session.sql("""
#       select store_id as rest_no, full_address as rest_address
#       from STORES where status = 'OPEN'
#   """)
rest_data_info = spark.read \
   .format("snowflake") \
   .options(**sfOptions) \
   ...
   .load()

Key mapping rules for the suggestion:

Extract the SQL from .option("query", ...) and pass it to snowflake_session.sql()
If .option("dbtable", "TABLE_NAME") is used instead of query, suggest snowflake_session.sql("SELECT * FROM TABLE_NAME")
Map sfDatabase, sfSchema, sfWarehouse options to USE DATABASE/SCHEMA/WAREHOUSE statements
The from snowflake.snowpark_connect.snowflake_session import SnowflakeSession import should appear once per file

UDF Serialization (ALL UDF patterns: udf(), @udf, @pandas_udf, applyInPandas, mapInPandas, factory-style udf() calls): When the workload uses UDFs that call helper functions, reference module-level variables, or import external modules, these will fail on Snowflake's server-side worker because cloudpickle serializes function references that point to the workload module (which doesn't exist on the server). Read references/udf-dependencies.md (Part 2) for the tiered fix approach: - Tier 1 (Preferred): Use snowpark.connect.udf.packages for Anaconda packages and snowpark.connect.udf.python.imports for custom modules uploaded to a stage. Import inside the UDF body. - Tier 2: For UDFs with simple logic (including factory-style udf() calls that return udf(fn, type)), keep all logic self-contained (inline) inside the closure body. Move all imports (import datetime, import ast, etc.), constants, and helper functions inside the UDF function body so cloudpickle captures them by value. Do NOT replace working UDFs with built-in SQL functions — apply the minimal fix to make the closure self-contained. - Tier 3: For complex UDFs that call many tightly-coupled helper functions in the same file, use the factory function pattern (to capture data in closures) and __module__ = "__main__" patching (to force serialization by value) on the UDF and all helper functions in its call chain.

```python

# Example: Tier 3 — factory + __module__ patching
def make_process_udf(config_dict):
    """Factory captures config in closure."""
    def process_udf(pdf):
        result = helper_a(pdf, config_dict)
        return helper_b(result)
    return process_udf

process_udf = make_process_udf(my_config)
for _fn in [process_udf, helper_a, helper_b]:
    _fn.__module__ = "__main__"

result = df.groupby("key").applyInPandas(process_udf, schema=output_schema)
```

To check Anaconda availability:
```sql
SELECT * FROM INFORMATION_SCHEMA.PACKAGES
WHERE LANGUAGE = 'python' AND PACKAGE_NAME ILIKE '%<package>%';
```

To use PyPI:
```python
spark.conf.set("snowpark.connect.artifact_repository", "snowflake.snowpark.pypi_shared_repository")
spark.conf.set("snowpark.connect.udf.packages", "[package1, package2]")
```

Step 3 Summary:
  Files with fixes applied: N
  Files with no issues:     M
  Total in manifest:        N + M  ← must match Step 2.1 count

Step 4 Progress:
  [x] src/etl/transformations.py  — imports updated, session updated
  [x] src/etl/loader.py           — imports updated, no session creation
  [ ] src/utils/helpers.py         — pending
  [ ] src/config.py                — pending
  [ ] src/__init__.py              — pending (no PySpark imports, header only)

from snowflake import snowpark_connect

spark = snowpark_connect.init_spark_session()

# BEFORE
from pyspark.sql import SparkSession
from databricks.connect import DatabricksSession
from databricks.sdk.runtime import dbutils

# AFTER
from pyspark.sql import SparkSession
# databricks imports removed - not supported in SCOS

"""
SCOS Migration Output
=====================
Source File: [Insert original file path, e.g., $ARGUMENTS/filename.py]
Migrated on: [Insert Current Date, e.g., 2023-10-27]

Changes Overview:
- [Lines 10-12] Replaced legacy SparkSession initialization with snowpark_connect.
- [Lines 45-50] Updated import statements to use Spark Connect equivalents.
- [Lines 88-92] [Description of another fix applied]

Known Limitations:
- [List every # SCOS: TODO item in this file, with line numbers and descriptions]
- [If none, write "None — all issues resolved"]
"""

"""
SCOS Migration Output
=====================
Source File: $ARGUMENTS/filename.py
Migrated on: [Current Date]

Changes Overview:
- No compatibility issues detected. No changes required.

Known Limitations:
- None — all issues resolved
"""

Step 5 Progress:
  [x] src/etl/transformations.py  — header added (3 changes, 1 TODO)
  [x] src/etl/loader.py           — header added (1 change)
  [ ] src/utils/helpers.py         — pending
  [ ] src/config.py                — pending
  [ ] src/__init__.py              — pending

Syntax Check: Run a syntax check on ALL files in the manifest to ensure no parse errors were introduced.

# For a single file:
python3 -m py_compile ${ARGUMENTS%.py}_scos.py

# For a directory (check ALL .py files):
find ${ARGUMENTS}_scos -name "*.py" -exec python3 -m py_compile {} \;

Per-File Review: For EACH file in the manifest, verify:
- All imports are correct (no mixed pyspark.sql and pyspark.sql.connect for the same classes).
- The snowpark_connect initialization is present (in files that create sessions).
- The migration header docstring is present at the top of the file.
- No critical TODO items remain that block execution.

Completeness Gate: Compare the manifest against the final state. This check is mandatory and MUST pass before proceeding.

# Count files in original vs migrated
echo "Original: $(find $ARGUMENTS -name '*.py' | wc -l) files"
echo "Migrated: $(find ${ARGUMENTS}_scos -name '*.py' | wc -l) files"

# Verify every migrated file has a migration header
for f in $(find ${ARGUMENTS}_scos -name "*.py" -type f | sort); do
  if head -5 "$f" | grep -q "SCOS Migration Output"; then
    echo "✓ $f"
  else
    echo "✗ $f — MISSING MIGRATION HEADER"
  fi
done

If any file is missing its migration header, go back and add it before proceeding. The migration is not complete until every .py file passes this check.

Migration complete. Would you like to validate the migrated workload
by running it end-to-end with synthetic data?

This will smoke-test the _scos code against a live SCOS session to
verify it runs without errors.

Unsupported Import	Action
`databricks.connect`	Remove - use `snowpark_connect` in entry point
`databricks.sdk.runtime`	Remove
`delta.tables`	Remove - Delta format not supported

Migrate PySpark to SCOS

When to Load

Arguments

Prerequisites

uv Package Manager

Snowflake Connection

Migrate PySpark to SCOS

When to Load

Arguments

Prerequisites

uv Package Manager

Snowflake Connection

RAG Knowledge Base

Tools

Tool: analyze_pyspark.py

Workflow

Step 0: Setup RAG Resources (One-Time)

Step 1: Analyze the Workload

Step 2: Create Migration Copy and File Manifest

2.1 Build the file manifest

2.2 Map analysis issues to files

Step 3: Apply Fixes from the Analysis output

Issue Processing Checklist

Files with No Issues

Step 4: Update Imports and Session Creation

4.1 Update Session Initialization

4.2 Remove Unsupported Imports

Step 5: Add Migration Header

Step 6: Verify Migration

Step 7: Offer Validation

Success Criteria

Troubleshooting

Output

Database Migrations

Database Migrations

Postgres Patterns

Frontend Query & Mutation

Db Migrations

Drizzle