Applies normalization, filtering, and deduplication transformations to a dataset, outputting the processed dataset with full transformation statistics. Use after cleaning, or when user says 'normalize data', 'filter dataset', 'deduplicate', 'apply transformations'.
Apply normalization, filtering, and deduplication transformations to a dataset. Produces a processed dataset with comprehensive statistics documenting every transformation applied.
transform_request:
dataset:
path: "<path_to_dataset>"
format: "jsonl" | "csv" | "json"
transformations: # applied in order
normalize:
enabled: true
rules:
- field: "input"
operations:
- "strip_whitespace"
- "normalize_unicode" # NFC normalization
- "lowercase" # optional, task-dependent
- "collapse_newlines" # multiple \n -> single \n
- field: "output"
operations:
- "strip_whitespace"
- "normalize_unicode"
filter:
enabled: true
rules:
- name: "min_input_length"
field: "input"
condition: "min_length"
value: 10
- name: "max_output_length"
field: "output"
condition: "max_length"
value: 2048
- name: "language"
field: "input"
condition: "language_is"
value: "en" # requires language detection
- name: "custom"
field: "label"
condition: "in_set"
value: ["positive", "negative", "neutral"]
deduplicate:
enabled: true
method: "exact_hash" # or "minhash_lsh"
fields: ["input"]
keep: "first"
output:
path: "/research_memory/30-data/processed/<name>/"
If dataset path is missing: STOP. Report the error.
Load dataset: Read all records. Record initial count and baseline statistics.
Apply normalization (if enabled):
strip_whitespace: Remove leading/trailing whitespacenormalize_unicode: Apply NFC normalizationlowercase: Convert to lowercasecollapse_newlines: Replace multiple consecutive newlines with singleremove_control_chars: Remove non-printable characters except newline/tabApply filters (if enabled):
min_length, max_length: character countmin_token_count, max_token_count: whitespace-split token countlanguage_is: language detection matchin_set: field value in allowed setnot_in_set: field value not in blocked setregex_match: field matches regexregex_exclude: field does NOT match regexApply deduplication (if enabled):
exact_hash: SHA-256 hash of concatenated fieldsminhash_lsh: MinHash with LSH for approximate dedup (similarity > 0.9)keep strategyDocument all transformations: Create a transformation manifest listing every operation applied, in order, with parameters.
Write outputs:
<output_path>/data_processed.jsonl - Final processed dataset<output_path>/filtered_out.jsonl - Records removed by filters (with reasons)<output_path>/dedup_removed.jsonl - Records removed by deduplication<output_path>/transform_manifest.yaml - Complete transformation documentation<output_path>/processing_report.yaml - StatisticsReport results.
## Data Processing Complete
- **Input Records**: [N]
- **Output Records**: [N] ([X.X%] retained)
- **Normalized**: [N] records modified
- **Filtered Out**: [N]
- **Deduplicated**: [N] removed
### Transformation Pipeline (applied in order)
1. normalize: strip_whitespace on [input, output] -> [N] records modified
2. normalize: normalize_unicode on [input, output] -> [N] records modified
3. filter: min_input_length >= 10 -> [N] removed
4. filter: max_output_length <= 2048 -> [N] removed
5. deduplicate: exact_hash on [input] -> [N] removed
### Filter Breakdown
| Filter | Records Removed | Percentage |
|--------|----------------|------------|
| min_input_length | [N] | [X.X%] |
| max_output_length | [N] | [X.X%] |
| language | [N] | [X.X%] |
| ... | ... | ... |
### Dataset Statistics (After Processing)
| Metric | Value |
|--------|-------|
| Total records | [N] |
| Avg input length | [N] chars |
| Avg output length | [N] chars |
| Unique inputs | [N] ([X.X%]) |
### Artifacts Written
- [output_path]/data_processed.jsonl
- [output_path]/filtered_out.jsonl
- [output_path]/dedup_removed.jsonl
- [output_path]/transform_manifest.yaml
- [output_path]/processing_report.yaml