Compute derived, aggregated, or transformed tables from existing datasets. Use this skill when the user needs to "compute monthly scores", "aggregate by month", "create a summary table", "derive risk scores", "compute percentile ranks", "roll up events", "create benchmarks from raw data", "add a computed column", or bridge the gap between raw generated tables and downstream analytics. Works on xlsx, csv, or json input. Claude writes the computation logic; the script handles data I/O.
Derive aggregated, scored, or transformed tables from existing datasets. This skill bridges the
gap between raw synthetic data (produced by synthdata-generate) and the derived metrics that
downstream tools expect — monthly rollups, composite scores, percentile ranks, segment summaries.
pip install openpyxl pandas numpy --break-system-packages
Ask the user:
Run the script in inspect mode to understand what we're working with:
python3 scripts/compute.py --inspect --input data.xlsx
This prints each table's name, column names with dtypes, row count, and a 3-row sample. Use this to understand the schema before writing computation logic.
Write a Python file that operates on pre-loaded DataFrames. The script provides:
tables — dict mapping table names to pandas DataFrames (all input tables pre-loaded)pd — pandasnp — numpyresult — empty dict; assign {name: DataFrame} pairs for each output tableExample computation file (monthly_risk.py):
import pandas as pd
import numpy as np
users = tables['users']
threats = tables['threat_events']
sims = tables['phishing_sims']
training = tables['training']
dlp = tables['dlp_events']
abuse = tables['abuse_mailbox']
# Extract month from event dates
threats['month'] = pd.to_datetime(threats['event_date']).dt.to_period('M')
sims['month'] = pd.to_datetime(sims['sim_date']).dt.to_period('M') if 'sim_date' in sims else ...
# Group and aggregate per user per month
# ... (Claude writes the full logic based on the user's requirements)
result['monthly_risk'] = monthly_risk_df
Save the computation to a .py file, then execute it.
python3 scripts/compute.py --input data.xlsx --code monthly_risk.py --output data_with_scores.xlsx
CLI flags:
| Flag | Description |
|---|---|
--input | Source dataset (xlsx/csv-dir/json) |
--output | Output path (default: <input>_computed.<ext>) |
--inspect | Print table schemas and exit |
--code | Path to a Python file with computation logic |
--expr | Inline Python expression (for simple one-liners) |
--append | Add computed tables to a copy of the input file (xlsx only) |
Inline mode for simple computations:
python3 scripts/compute.py --input data.xlsx --output scored.xlsx \
--expr "result['summary'] = tables['events'].groupby('category').size().reset_index(name='count')"
After execution, the script prints:
Review the output with the user before proceeding.
Use these patterns as starting points when writing computation logic.
events = tables['threat_events']
events['month'] = pd.to_datetime(events['event_date']).dt.to_period('M').astype(str)
monthly = events.groupby(['user_id', 'month']).agg(
threat_events=('event_id', 'count'),
real_clicks=('clicked', 'sum'),
credentials_entered=('credentials_entered', 'sum'),
).reset_index()
result['monthly_threats'] = monthly
# Weights must sum to 1.0
WEIGHTS = {'threat': 0.35, 'sim': 0.25, 'dlp': 0.20, 'training': 0.10, 'reporting': 0.10}
df['composite'] = (
df['threat_score'] * WEIGHTS['threat']
+ df['sim_score'] * WEIGHTS['sim']
+ df['dlp_score'] * WEIGHTS['dlp']
+ df['training_score'] * WEIGHTS['training']
+ df['reporting_score'] * WEIGHTS['reporting']
)
def assign_tier(score):
if score >= 75: return 'Critical'
if score >= 50: return 'High'
if score >= 25: return 'Medium'
return 'Low'
df['risk_tier'] = df['composite_risk_score'].apply(assign_tier)
df['percentile'] = df['composite_risk_score'].rank(pct=True) * 100
peers = tables['peers_detail']
metrics = ['sim_click_rate', 'training_completion', 'composite_risk_score']
rows = []
for segment_col in ['segment_healthcare', 'segment_geo', 'segment_size']:
cohort = peers[peers[segment_col] == 1]
for m in metrics:
rows.append({
'segment': segment_col, 'metric': m,
'mean': cohort[m].mean(), 'median': cohort[m].median(),
'p25': cohort[m].quantile(0.25), 'p75': cohort[m].quantile(0.75),
})
result['segment_summaries'] = pd.DataFrame(rows)
users = tables['users']
risk = tables['monthly_risk']
merged = risk.merge(users[['user_id', 'department']], on='user_id')
dept = merged.groupby('department').agg(
avg_risk=('composite_risk_score', 'mean'),
critical_count=('risk_tier', lambda x: (x == 'Critical').sum()),
user_count=('user_id', 'nunique'),
).sort_values('avg_risk', ascending=False).reset_index()
result['department_risk'] = dept
| Need | Skill |
|---|---|
| Generate new rows from a schema/template | synthdata-generate |
| Extract Excel → JSON | synthdata-extract |
| Add rows/columns to existing data | synthdata-extend |
| Replace PII with fake values | synthdata-anonymize |
| Aggregate, score, rank, or derive new tables from existing data | synthdata-compute |