Data processing and analysis with pandas, Polars, NumPy, and data pipeline patterns. Trigger when user mentions pandas, polars, numpy, dataframe, data pipeline, ETL, data wrangling, data cleaning, data transformation, CSV processing, parquet, arrow, columnar data, lazy evaluation, groupby, aggregation. Also trigger when user asks about data analysis, time series processing, or choosing between pandas and polars.
import polars as pl
# 讀取 CSV → 篩選 → 聚合 → 輸出
result = (
pl.scan_csv("sales.csv") # 延遲讀取
.filter(pl.col("amount") > 100) # 篩選
.group_by("category") # 分組
.agg(
pl.col("amount").sum().alias("total"),
pl.len().alias("count"),
)
.sort("total", descending=True)
.collect() # 觸發執行
)
print(result)
pandas 2.x 引入 PyArrow 後端,大幅改善記憶體效率與型別支援:
import pandas as pd
# 使用 Arrow 後端(更省記憶體、支援更多型別)
df = pd.read_csv("data.csv", dtype_backend="pyarrow")
# Copy-on-Write(pandas 3.0 預設啟用)
pd.options.mode.copy_on_write = True
df2 = df[["col_a", "col_b"]] # 不立即複製,修改時才複製
df2["col_a"] = 0 # 此時才觸發複製
Polars 的延遲模式允許查詢最佳化:
import polars as pl
# Lazy API:建立查詢計畫,最後才執行
lazy_df = (
pl.scan_parquet("events/*.parquet")
.with_columns(
pl.col("timestamp").dt.year().alias("year"),
(pl.col("price") * pl.col("quantity")).alias("revenue"),
)
.filter(pl.col("year") >= 2024)
.group_by("product_id")
.agg(pl.col("revenue").sum())
)
# 查看查詢計畫
print(lazy_df.explain())
# 執行查詢
result = lazy_df.collect()
NumPy 提供高效的陣列運算,是整個資料生態系的基礎:
import numpy as np
# 向量化 vs 迴圈:快 10-100 倍