Name: Normalize Filter
Author: pminhyung

Normalize Filter

Applies normalization, filtering, and deduplication transformations to a dataset, outputting the processed dataset with full transformation statistics. Use after cleaning, or when user says 'normalize data', 'filter dataset', 'deduplicate', 'apply transformations'.

pminhyung0 스타2026. 2. 13.

직업
카테고리: 과학 컴퓨팅

Purpose

Apply normalization, filtering, and deduplication transformations to a dataset. Produces a processed dataset with comprehensive statistics documenting every transformation applied.

When to Use

After eda-noise-clean, as the next processing step
DataQualityEDAAgent specifies normalization or filtering policies
User requests text normalization, record filtering, or deduplication
Before split-builder, to prepare the final processed dataset

Input Contract

transform_request:
  dataset:
    path: "<path_to_dataset>"
    format: "jsonl" | "csv" | "json"
  transformations:  # applied in order
    normalize:
      enabled: true
      rules:
        - field: "input"
          operations:
            - "strip_whitespace"
            - "normalize_unicode"  # NFC normalization
            - "lowercase"  # optional, task-dependent
            - "collapse_newlines"  # multiple \n -> single \n
        - field: "output"
          operations:
            - "strip_whitespace"
            - "normalize_unicode"
    filter:
      enabled: true
      rules:
        - name: "min_input_length"
          field: "input"
          condition: "min_length"
          value: 10
        - name: "max_output_length"
          field: "output"
          condition: "max_length"
          value: 2048
        - name: "language"
          field: "input"
          condition: "language_is"
          value: "en"  # requires language detection
        - name: "custom"
          field: "label"
          condition: "in_set"
          value: ["positive", "negative", "neutral"]
    deduplicate:
      enabled: true
      method: "exact_hash"  # or "minhash_lsh"
      fields: ["input"]
      keep: "first"
  output:
    path: "/research_memory/30-data/processed/<name>/"

Normalize Filter

pminhyung0 스타2026. 2. 13.

직업
카테고리: 과학 컴퓨팅

Input Contract

transform_request: dataset: path: "<path_to_dataset>" format: "jsonl" | "csv" | "json" transformations: # applied in order normalize: enabled: true rules: - field: "input" operations: - "strip_whitespace" - "normalize_unicode" # NFC normalization - "lowercase" # optional, task-dependent - "collapse_newlines" # multiple \n -> single \n - field: "output" operations: - "strip_whitespace" - "normalize_unicode" filter: enabled: true rules: - name: "min_input_length" field: "input" condition: "min_length" value: 10 - name: "max_output_length" field: "output" condition: "max_length" value: 2048 - name: "language" field: "input" condition: "language_is" value: "en" # requires language detection - name: "custom" field: "label" condition: "in_set" value: ["positive", "negative", "neutral"] deduplicate: enabled: true method: "exact_hash" # or "minhash_lsh" fields: ["input"] keep: "first" output: path: "/research_memory/30-data/processed/<name>/"

Normalize Filter

Purpose

When to Use

Input Contract

Normalize Filter

Purpose

When to Use

Input Contract

Process

Output Format

Rules

Deep Research

Data Analyst

Academic Researcher

Data Scientist

Biopython

Binary Analysis Patterns