Name: Skill: ml-data-curation-and-refinement
Author: Dingxingdi

Skill: ml-data-curation-and-refinement

Use this skill when the user wants to prepare, clean, or label datasets for machine learning using specialized code and automated models. Trigger it for requests like 'fill in the missing gaps in my data', 'automatically label this dataset based on our rules', 'impute the null values in my training set', or 'fix the data quality before training'. It is specifically designed for complex ML engineering tasks involving model-based imputation (like recovering sensor readings) and weak supervision pipelines (like heuristic-based labeling) rather than simple SQL data transformations.

Dingxingdi0 星標2026年4月10日

職業
分類: 機器學習

1. Capability Definition & Real Case

Professional Definition: The ability to architect and implement data-centric engineering workflows for machine learning, specifically focusing on the automation of data imputation and weak supervision pipelines to enhance dataset quality and model readiness. This involves performing a diagnostic analysis of dataset missingness distributions, implementing iterative quality-refinement loops using specialized algorithms (e.g., diffusion-based imputation), and utilizing programmatic labeling functions to transform raw or corrupted datasets into high-fidelity training assets while maintaining statistical and semantic integrity.
Dimension Hierarchy: Data and ML Workflow Engineering->Machine Learning Engineering->ml-data-curation-and-refinement

Real Case

[Case 1]

Initial Environment: A machine learning workspace contains a weather sensor dataset (PM2.5) with 40% missing entries and a repository for the CSDI (Conditional Score-based Diffusion) imputation model. The environment includes a configuration file defining the temporal and spatial coordinates of the sensor network.

Skill: ml-data-curation-and-refinement

Dingxingdi0 星標2026年4月10日

職業
分類: 機器學習

1. Capability Definition & Real Case

Professional Definition: The ability to architect and implement data-centric engineering workflows for machine learning, specifically focusing on the automation of data imputation and weak supervision pipelines to enhance dataset quality and model readiness. This involves performing a diagnostic analysis of dataset missingness distributions, implementing iterative quality-refinement loops using specialized algorithms (e.g., diffusion-based imputation), and utilizing programmatic labeling functions to transform raw or corrupted datasets into high-fidelity training assets while maintaining statistical and semantic integrity.

Dimension Hierarchy: Data and ML Workflow Engineering->Machine Learning Engineering->ml-data-curation-and-refinement

Real Case

[Case 1]

Initial Environment: A machine learning workspace contains a weather sensor dataset (PM2.5) with 40% missing entries and a repository for the CSDI (Conditional Score-based Diffusion) imputation model. The environment includes a configuration file defining the temporal and spatial coordinates of the sensor network.

Skill: ml-data-curation-and-refinement

1. Capability Definition & Real Case

Real Case

Skill: ml-data-curation-and-refinement

1. Capability Definition & Real Case

Real Case

Pipeline Execution Instructions

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns