Name: Datacheck
Author: letitbk

Buscar habilidades.../

file <filename>                          # encoding detection
head -c 500 <filename> | cat -v          # hidden chars: ^M (CR), BOM, non-UTF8
wc -l <filename>                         # row count sanity check
head -3 <filename>                       # peek at delimiter and header

Symptom	Meaning
`^M` at line ends	Classic Mac `\r` or Windows `\r\n` line endings
`\xEF\xBB\xBF` at start	UTF-8 BOM marker
Only 1 line from `wc -l`	Entire file on one line (wrong line endings)
Mixed delimiters	Inconsistent separator characters

import pandas as pd
df = pd.read_csv("file.csv")  # or sep='\t' for TSV

import pandas as pd
xls = pd.ExcelFile("file.xlsx")
print(f"Sheets: {xls.sheet_names}")
df = pd.read_excel("file.xlsx", sheet_name=0)  # or specific sheet name

df = pd.read_json("file.json")

Column          Type        Missing   Unique   Example Values
─────────────────────────────────────────────────────────────
revenue         float64     12 (2%)   845      19.99, 49.50, 129.00
category        object      0 (0%)    8        Electronics, Clothing, Home
order_date      datetime    3 (0.5%)  365      2024-01-15, 2024-02-20

Check	What to Look For
Coded missings	97/98/99, -1/-9/-99 in numeric columns with lower range
Placeholder values	"N/A", "n/a", "#N/A", "NULL", "-", "" in string columns
Date parsing	Dates stored as strings, mixed formats (MM/DD vs DD/MM)
Currency strings	"$1,234.56" stored as text instead of numeric
Duplicate rows	Exact duplicates and near-duplicates (same key columns)
ID uniqueness	If a column looks like an ID, check it's actually unique

DATA QUALITY ISSUES:
  revenue: 15 values are "$X,XXX.XX" strings. Needs: df['revenue'] = df['revenue'].str.replace('[$,]', '', regex=True).astype(float)
  status: contains {'N/A', '-', ''} (n=23). Likely missing values.
  order_date: mixed formats detected (2024-01-15 and 01/15/2024). Needs standardization.

# Quick profiling
print(df.describe(include='all'))
print(df.nunique())

# Correlation matrix for numeric columns
numeric_cols = df.select_dtypes(include='number')
if len(numeric_cols.columns) > 1:
    corr = numeric_cols.corr()
    # Show top correlated pairs

# SUGGESTED FIX for revenue currency strings:
df['revenue'] = df['revenue'].str.replace('[$,]', '', regex=True).astype(float)

# SUGGESTED FIX for placeholder missings:
df['status'] = df['status'].replace({'N/A': pd.NA, '-': pd.NA, '': pd.NA})

File type	Load with	Extra checks
`.csv`	`pd.read_csv()`	encoding, delimiter, line endings
`.tsv`	`pd.read_csv(sep='\t')`	encoding, line endings
`.xlsx` / `.xls`	`pd.read_excel()`	sheets, merged cells, header row
`.json`	`pd.read_json()`	nested structure, normalize if needed
`.parquet`	`pd.read_parquet()`	just structure check

Mistake	Fix
Loading Excel without checking sheets	Always check `pd.ExcelFile(f).sheet_names` first
Assuming column types are correct	Check with `df.dtypes` — revenue as object means string
Ignoring encoding issues	ALWAYS run `file` and `cat -v` before loading
Auto-fixing without asking	Report + suggest. Never auto-recode values
Not checking for duplicate rows	Always run `df.duplicated().sum()`

Datacheck | Skills Pool

Item	Value
Dimensions	rows x cols
Column names	list all
Column types	int, float, object, datetime, bool
Total missing	count and %

Datacheck

Datacheck

Data Check

When to Use

When NOT to Use

Workflow

Step 1: Raw File Inspection

Step 2: Load and Summarize

Step 3: Excel-Specific Checks

Step 4: Data Quality Checks

Step 5: Light Profiling

Step 6: Report Summary

Quick Reference

Common Mistakes

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing