Skill File

Data Join

Name: Data Join
Author: dathere

Join two datasets with automatic strategy selection (joinp vs join vs sqlp)

dathere3,600 starsApr 14, 2026

Occupation: Data Scientists
Categories: Data Analysis

Skill Content

Join two tabular data files on common columns.

Cowork note: If relative paths don't resolve, call mcp__qsv__qsv_get_working_dir and mcp__qsv__qsv_set_working_dir to sync the working directory.

Strategy Selection

Scenario	Best Tool	Why
Standard equi-join	`mcp__qsv__qsv_joinp`	Polars engine, fastest
Non-equi join (>, <, BETWEEN)	`mcp__qsv__qsv_sqlp`	SQL supports complex conditions
Cross join / cartesian	`mcp__qsv__qsv_sqlp`	`CROSS JOIN` syntax
Memory-constrained	`mcp__qsv__qsv_command` with `command: "join"`	Streaming, lower memory
Fuzzy/approximate match

Related Skills

Data Join | Skills Pool

Index both files: Run mcp__qsv__qsv_index on both files for fast random access.
Inspect both files: Run mcp__qsv__qsv_headers on both files to identify column names. Determine which columns to join on.
Profile join columns: Run mcp__qsv__qsv_stats with cardinality: true, stats_jsonl: true on both files. Check the cardinality of join columns to determine optimal table order.
Choose strategy:
- If cardinality of join column in file1 > file2, put file1 on the left
- For joinp: smaller cardinality table should be on the right for best performance
- If join condition is complex (non-equi), use mcp__qsv__qsv_sqlp
- If join involves date/time matching where exact dates won't align (e.g., quarterly to monthly, event dates to nearest reporting period), use mcp__qsv__qsv_joinp with asof: true
Execute join: Use mcp__qsv__qsv_joinp for standard joins:
```
joinp
  columns1: "id"
  input1: "file1.csv"
  columns2: "id"
  input2: "file2.csv"
  # Join type: omit for inner (default), or set one of:
  # left: true, full: true, cross: true
```
Or use mcp__qsv__qsv_sqlp for complex joins:
```
SELECT a.*, b.col1, b.col2
FROM file1 a
JOIN file2 b ON a.id = b.id AND a.date BETWEEN b.start_date AND b.end_date
```
For ASOF (nearest-match) joins, use mcp__qsv__qsv_joinp with asof: true:
```
joinp
  columns1: "date"
  input1: "events.csv"
  columns2: "date"
  input2: "reference.csv"
  asof: true
  strategy: "backward"
  allow_exact_matches: true
```
- strategy: "backward" (default) — match to the last right row with key < left key
- strategy: "forward" — match to the first right row with key > left key
- strategy: "nearest" — match to the numerically closest row (supports tolerance parameter)
- Add left_by/right_by parameters to restrict matching within subgroups (e.g., per jurisdiction)
- Add allow_exact_matches: true to include equal keys (<=, >=); default is strict inequality (<, >)
Clean up result: Use mcp__qsv__qsv_select to remove duplicate join columns or unnecessary columns from the result.
Verify: Run mcp__qsv__qsv_count on the result. Compare with input counts to validate join behavior:
- Inner join: result <= min(left, right)
- Left join: result >= left count
- Full outer: result >= max(left, right)
- ASOF: result = left count (every left row gets a match or null, like a left join)

Check	Stats Column	Red Flag	Action
Type match	`type`	Join columns have different types (e.g., Integer vs String)	Cast one column before joining: `sqlp` with `CAST(col AS INTEGER)`
Null density	`nullcount`, `sparsity`	sparsity > 0.3 on join column	Nulls don't match — expect unmatched rows; consider filtering nulls first
Value overlap	`min`, `max`	Non-overlapping ranges across files	No rows will match — verify correct join column
Skew detection	`mode`, `mode_count`	One value dominates (mode_count > 50% of rows)	Join will be heavily skewed many-to-one; verify this is expected
Uniqueness	`uniqueness_ratio`	Both files have uniqueness_ratio < 1.0 on join column	Many-to-many join risk — expect row explosion; verify with `mcp__qsv__qsv_count` after
Outlier keys	`outliers_percentage`	outliers_percentage > 5% on numeric join column	Outlier keys may not match across files; consider trimming first

Type	`joinp` Flag	SQL	Behavior
Inner	(default)	`JOIN`	Only matching rows
Left	`--left`	`LEFT JOIN`	All left + matching right
Full outer	`--full`	`FULL OUTER JOIN`	All rows from both
Cross	`--cross`	`CROSS JOIN`	Cartesian product
Left Anti	`--left-anti`	`NOT IN` / `NOT EXISTS`	Left rows without match
Left Semi	`--left-semi`	`EXISTS`	Left rows with match (no right cols)
ASOF	`--asof`	(use joinp)	Nearest-key match (temporal/numeric)

Data Join

Strategy Selection

Data Join

Strategy Selection

Steps

Join Column Validation Checklist

Join Types

Notes

Visualization Expert

Data Analyst

Huggingface Hub

Multi Reviewer Patterns

Dbt Transformation Patterns

Startup Financial Modeling