Database driver selection and bulk insert optimization for DataSurface. Covers psycopg2 vs psycopg3, pyodbc fast_executemany, SQLAlchemy bypasses, and the execute_fast_insert pattern. Includes measured benchmarks.
Database driver choice and insert method have a dramatic impact on DataSurface throughput. The wrong combination can be 10-25x slower. This guide covers what we measured, what we chose, and why.
All benchmarks measured on the same hardware, inserting 10,000 rows into a staging table:
| Database | Driver | Method | Rows/sec | Notes |
|---|---|---|---|---|
| PostgreSQL | psycopg2 | SA executemany | 5,500 | One round-trip per row |
| PostgreSQL | psycopg3 | SA executemany | 51,700 | Proper batching via pipeline |
| PostgreSQL | psycopg3 | Multi-row VALUES | 169,000 | Fastest but requires string formatting |
| SQL Server | pyodbc | SA text() executemany | 3,263 | SA bypasses fast_executemany |
| SQL Server |
| pyodbc |
Raw cursor fast_executemany=True |
| 85,235 |
| Must drop to raw DBAPI cursor |
| MySQL | pymysql | SA executemany | ~2,000 | Pure Python driver, inherently slow |
| MySQL | pymysql | Multi-row VALUES | ~15,000 | Only DB still using VALUES path |
psycopg2's executemany sends one INSERT per row — one network round-trip each. At 5,500 rows/sec over a LAN, this is the bottleneck for any batch larger than a few hundred rows.
psycopg3 uses pipelining: it batches multiple operations into a single network round-trip. SQLAlchemy's executemany with psycopg3 achieves 51,700 rows/sec with no code changes.
# In PostgresDatabase adapter
def get_driver_name(self) -> str:
return "postgresql+psycopg" # NOT "postgresql" (psycopg2)
Requirements:
psycopg[binary]==3.2.9 # NOT psycopg2-binary
psycopg3 must be available in the Airflow worker image because DataSurface code is imported at DAG parse time. If using the community Airflow Helm chart, build a custom image:
FROM apache/airflow:3.1.8
RUN pip install psycopg[binary]==3.2.9
pyodbc has a fast_executemany flag that batches parameter sets into a single round-trip. However, SQLAlchemy's text() executemany does not use it — even when fast_executemany=True is set in connect_args.
SQLAlchemy's text().executemany() path calls the DBAPI cursor's executemany without the optimization. You must drop to the raw DBAPI cursor to get the fast path.
DataSurface uses a shared utility execute_fast_insert() in database_operations.py that:
dbapi_connection.cursor(), sets fast_executemany = True, executesconnection.execute(text(...), params)from datasurface.platforms.yellow.database_operations import execute_fast_insert, is_pyodbc_connection
# Usage in merge/ingestion code:
execute_fast_insert(
connection=connection,
sql=insert_sql, # INSERT INTO ... VALUES (:col1, :col2, ...)
params=list_of_dicts, # [{"col1": v1, "col2": v2}, ...]
logger=logger
)
Multi-row VALUES (INSERT INTO t VALUES (1,'a'),(2,'b'),...) requires converting Python values to SQL string literals. This is:
fast_executemany for pyodbc (85K vs ~40K rows/sec)pyodbc's fast_executemany sends binary parameter arrays — no string conversion needed.
pymysql is a pure-Python driver with no fast_executemany equivalent. Its executemany is extremely slow (~2,000 rows/sec). For MySQL, DataSurface uses the multi-row VALUES path via format_sql_value():
# MySQL adapter signals this:
def supports_batch_values_insert(self) -> bool:
return True # Use VALUES string building
All other database adapters return False (use native driver executemany).
The format_sql_value() function in database_operations.py converts Python types to SQL string literals. It handles:
None → NULLstr → escaped and quoteddatetime → ISO format quotedDecimal, int, float → numeric literalbytes → hex literaluuid.UUID → string quotedbool → 1/0This function is only used for MySQL. All other databases use parameterized queries via their native driver.
| Database | SQLAlchemy Driver | Bulk Insert Method | Adapter Flag |
|---|---|---|---|
| PostgreSQL | postgresql+psycopg | Native executemany (psycopg3 pipeline) | supports_batch_values_insert = False |
| SQL Server | mssql+pyodbc | Raw cursor fast_executemany=True | supports_batch_values_insert = False |
| MySQL | mysql+pymysql | Multi-row VALUES via format_sql_value | supports_batch_values_insert = True |
| Oracle | oracle+oracledb | Native executemany | supports_batch_values_insert = False |
| DB2 | db2+ibm_db | Native executemany | supports_batch_values_insert = False |
| Snowflake | snowflake | Native executemany | supports_batch_values_insert = False |
text().executemany() bypasses pyodbc's fast_executemany. Always benchmark.