Join two datasets with automatic strategy selection (joinp vs join vs sqlp)
Join two tabular data files on common columns.
Cowork note: If relative paths don't resolve, call
mcp__qsv__qsv_get_working_dirandmcp__qsv__qsv_set_working_dirto sync the working directory.
| Scenario | Best Tool | Why |
|---|---|---|
| Standard equi-join | mcp__qsv__qsv_joinp | Polars engine, fastest |
| Non-equi join (>, <, BETWEEN) | mcp__qsv__qsv_sqlp | SQL supports complex conditions |
| Cross join / cartesian | mcp__qsv__qsv_sqlp | CROSS JOIN syntax |
| Memory-constrained | mcp__qsv__qsv_command with command: "join" | Streaming, lower memory |
| Fuzzy/approximate match |
mcp__qsv__qsv_joinp with asof: true |
| Nearest-match join |
Index both files: Run mcp__qsv__qsv_index on both files for fast random access.
Inspect both files: Run mcp__qsv__qsv_headers on both files to identify column names. Determine which columns to join on.
Profile join columns: Run mcp__qsv__qsv_stats with cardinality: true, stats_jsonl: true on both files. Check the cardinality of join columns to determine optimal table order.
Choose strategy:
joinp: smaller cardinality table should be on the right for best performancemcp__qsv__qsv_sqlpmcp__qsv__qsv_joinp with asof: trueExecute join: Use mcp__qsv__qsv_joinp for standard joins:
joinp
columns1: "id"
input1: "file1.csv"
columns2: "id"
input2: "file2.csv"
# Join type: omit for inner (default), or set one of:
# left: true, full: true, cross: true
Or use mcp__qsv__qsv_sqlp for complex joins:
SELECT a.*, b.col1, b.col2
FROM file1 a
JOIN file2 b ON a.id = b.id AND a.date BETWEEN b.start_date AND b.end_date
For ASOF (nearest-match) joins, use mcp__qsv__qsv_joinp with asof: true:
joinp
columns1: "date"
input1: "events.csv"
columns2: "date"
input2: "reference.csv"
asof: true
strategy: "backward"
allow_exact_matches: true
strategy: "backward" (default) — match to the last right row with key < left keystrategy: "forward" — match to the first right row with key > left keystrategy: "nearest" — match to the numerically closest row (supports tolerance parameter)left_by/right_by parameters to restrict matching within subgroups (e.g., per jurisdiction)allow_exact_matches: true to include equal keys (<=, >=); default is strict inequality (<, >)Clean up result: Use mcp__qsv__qsv_select to remove duplicate join columns or unnecessary columns from the result.
Verify: Run mcp__qsv__qsv_count on the result. Compare with input counts to validate join behavior:
Before executing a join, read .stats.csv for both files and validate:
| Check | Stats Column | Red Flag | Action |
|---|---|---|---|
| Type match | type | Join columns have different types (e.g., Integer vs String) | Cast one column before joining: sqlp with CAST(col AS INTEGER) |
| Null density | nullcount, sparsity | sparsity > 0.3 on join column | Nulls don't match — expect unmatched rows; consider filtering nulls first |
| Value overlap | min, max | Non-overlapping ranges across files | No rows will match — verify correct join column |
| Skew detection | mode, mode_count | One value dominates (mode_count > 50% of rows) | Join will be heavily skewed many-to-one; verify this is expected |
| Uniqueness | uniqueness_ratio | Both files have uniqueness_ratio < 1.0 on join column | Many-to-many join risk — expect row explosion; verify with mcp__qsv__qsv_count after |
| Outlier keys | outliers_percentage | outliers_percentage > 5% on numeric join column | Outlier keys may not match across files; consider trimming first |
| Type | joinp Flag | SQL | Behavior |
|---|---|---|---|
| Inner | (default) | JOIN | Only matching rows |
| Left | --left | LEFT JOIN | All left + matching right |
| Full outer | --full | FULL OUTER JOIN | All rows from both |
| Cross | --cross | CROSS JOIN | Cartesian product |
| Left Anti | --left-anti | NOT IN / NOT EXISTS | Left rows without match |
| Left Semi | --left-semi | EXISTS | Left rows with match (no right cols) |
| ASOF | --asof | (use joinp) | Nearest-key match (temporal/numeric) |
joinp uses the Polars engine and is significantly faster than join for large filesjoinp optimize join executioncolumns1: "col1,col2"columns1: "id", columns2: "customer_id"joinp handles null values in join columns (nulls don't match by default)--try-parsedates — no need to pass it explicitly--left_by and --right_by (e.g., match nearest date per jurisdiction)--tolerance option (nearest strategy only) limits how far the nearest match can be: use duration strings for dates (1d, 30d, 365d) or positive integers for numeric keys--no-sort is set