Name: Vietnamese-Khmer Cultural MT Benchmark
Author: toan-huynh-ai

スキルを検索.../

Vietnamese-Khmer Cultural MT Benchmark | Skills Pool

MT/
├── .env                               # Azure OpenAI credentials (NEVER commit)
├── all_1.jsonl, all_2.jsonl           # Parallel data (1,856 samples total)
├── config.py                          # Azure/project config
├── core/                              # Shared clients
│   ├── auth.py
│   ├── azure_client.py
│   └── embeddings.py
│
├── ── CKB & EVALUATION ──
├── cultural_kb_expanded.py            # ★ CKB v2: 132 entries, A/B/C taxonomy
├── cultural_knowledge_base_v2.json    # ★ Exported CKB v2 (JSON)
├── evaluation_framework.py            # ★ CuEA + Script Purity implementation
├── cultural_kb.py                     # CKB v1 (53 entries, superseded by v2)
├── cultural_knowledge_base.json       # CKB v1 export (legacy)
│
├── ── EXPERIMENTS ──
├── experiment_full.py                 # ★ Full exp: 40 cultural samples, KB-RAG
├── experiment_pilot.py                # Pilot: zero-shot, few-shot, context
├── find_weaknesses.py                 # 6 weakness probes (48 samples)
├── test_kb_rag.py                     # KB-RAG v1 ablation (6 samples)
├── analyze_results.py                 # Pilot results analysis
├── analyze_weaknesses.py              # Weakness probe analysis
│
├── ── RESULTS ──
├── experiment_results/
│   ├── full_experiment_20260409_164322.json   # ★ Main results (40 samples)
│   ├── pilot_results_20260408_154436.json
│   ├── weakness_probe_20260408_163922.json
│   └── kb_rag_results.json
│
├── ── REPORTS & DOCS ──
├── PATH_A_FINAL_REPORT.md             # ★ Latest comprehensive report
├── FINAL_RESEARCH_REPORT.md           # Previous full report
├── GPT4o_WEAKNESS_REPORT.md           # Weakness catalog
├── contributions.md                   # Paper contribution plan (C1-C10)
├── khmer_diff.md                      # Khmer Cambodia vs Krom analysis
├── critique_and_revised_directions.md # Self-critique of 5 directions
└── research_directions.md             # Original 5 proposed directions

{
  "id": 71645,
  "text": "Vietnamese source text",
  "question": "Optional Vietnamese question (QA format only)",
  "label": ["Khmer translation 1", "Khmer translation 2 with *** annotations"],
  "Comments": [],
  "topic": "Topic name (dialogue format only)",
  "order": 1.0
}

pip install openai azure-identity httpx sacrebleu python-dotenv

import httpx
from azure.identity import ClientSecretCredential, get_bearer_token_provider
from openai import AzureOpenAI

http_client = httpx.Client(verify=False, proxy=os.getenv("HTTPS_PROXY"))
credential = ClientSecretCredential(
    tenant_id=os.getenv("AZURE_TENANT_ID"),
    client_id=os.getenv("APPLICATION_AI_VOS_USERS_ID"),
    client_secret=os.getenv("APPLICATION_AI_VOS_USERS_SECRET"),
    connection_verify=False,
)
token_provider = get_bearer_token_provider(
    credential, "https://cognitiveservices.azure.com/.default"
)
client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_API_VERSION"),
    azure_ad_token_provider=token_provider,
    http_client=http_client,
)

Script	What it does	Time	Key results
`experiment_full.py`	40 cultural samples, plain vs KB-RAG	~45 min	Main paper results
`experiment_pilot.py`	Zero-shot, few-shot, dialogue context	~30 min	Pilot baselines
`find_weaknesses.py`	6 targeted probes + back-translation	~20 min	Weakness evidence
`test_kb_rag.py`	CKB-RAG v1 on 6 known-fail samples	~5 min	Entity fix proof
`cultural_kb_expanded.py`	Build/export CKB v2, lookup, RAG context	instant	Run standalone
`evaluation_framework.py`	CuEA + Script Purity demo	instant	Metric validation

cd C:\Users\HOY9HC\Desktop\Code\Learning\MT
python experiment_full.py        # ★ Main experiment
python experiment_pilot.py       # Pilot (context, few-shot)
python find_weaknesses.py        # Weakness probes
python cultural_kb_expanded.py   # Build/verify CKB v2

from evaluation_framework import compute_cuea
result = compute_cuea(source_vi, hypothesis_km, reference_km)
# Returns: {"cuea": 0.937, "n_entities": 8, "n_correct": 7, "details": [...]}

from evaluation_framework import compute_script_purity
result = compute_script_purity(hypothesis_km)
# Returns: {"purity": 0.913, "is_pure": False, "n_chinese_chars": 2, ...}

from evaluation_framework import classify_errors
result = classify_errors(source_vi, hypothesis_km, reference_km)
# Returns: standard_metrics + cuea + script_purity + error_taxonomy

┌───────────────────────────────────────┬───────┬─────────┬────────┐
│ Condition                             │ BLEU  │ chrF++  │ CuEA   │
├───────────────────────────────────────┼───────┼─────────┼────────┤
│ Zero-shot GPT-4o (general)            │  0.79 │  37.98  │   —    │
│ Zero-shot GPT-4o (cultural samples)   │  2.67 │  38.64  │  0.419 │
│ Random 3-shot                         │  1.39 │  44.36  │   —    │
│ Topic-matched 3-shot                  │  2.33 │  44.16  │   —    │
│ Full dialogue context                 │  1.85 │  45.11  │   —    │
│ CKB-RAG v2 (40 cultural samples)      │  1.76 │  41.02  │  0.937 │
├───────────────────────────────────────┼───────┼─────────┼────────┤
│ CKB-RAG delta                         │       │  +2.38  │ +0.518 │
│ CKB error reduction                   │       │         │  86%   │
│ chrF++ win rate (context)             │       │ 8/10    │        │
│ CuEA win rate (CKB)                   │       │ 32/40   │        │
└───────────────────────────────────────┴───────┴─────────┴────────┘

Weakness probes (chrF++, weakest first):
  Complex sentences   36.36
  Kinship terms       37.43
  Colloquial speech   38.76
  Food/cuisine        39.46
  Religious/ritual    43.13
  Khmer Krom regional 44.27

Group A (Loanwords):  67 entries  — need etymological mapping
Group B (Romanized):  46 entries  — need back-transliteration
Group C (Toponyms):   19 entries  — do NOT translate literally

from cultural_kb_expanded import lookup, build_rag_context

# Find entities in Vietnamese text
entities = lookup("Tôi làm cốm dẹp cho lễ Ok Om Bok tại Tri Tôn")
# Returns: [{vi: "cốm dẹp", km: "អំបុក", group: "A", category: "food"}, ...]

# Generate RAG context for translation prompt
rag_text = build_rag_context("Tôi làm cốm dẹp cho lễ Ok Om Bok tại Tri Tôn")
# Returns: "Cultural terminology reference (Khmer Krom dialect):\n  'cốm dẹp' → 'អំបុក' ..."

Category	Count	Key examples
food	20	cốm dẹp→អំបុក, bánh tét→នំអន្សម, mắm bò hóc→ម៉ាំប្រហុក
religious	18	chùa→វត្ត, Sư→ព្រះសង្ឃ, tắm Phật→ស្រង់ព្រះ
toponyms	18	Sóc Trăng→ខេត្តឃ្លាំង, Trà Vinh→ព្រះត្រពាំង, Tri Tôn→ស្រុកបាយ៉ង់
romanized	24	Chol Chnam Thmay→ចូលឆ្នាំថ្មី, Ok Om Bok→អកអំបុក
kinship	11	bác→ធំ, cô→មីង, bà ngoại→យាយ
cultural_practices	11	phum sóc→ភូមិសង្គម, rong vong→រាំវង់
agriculture	6	lúa mùa nổi→ស្រូវវស្សាអណ្ដែត, tre→ឫស្សី
festivals	7	Kathina→កឋិនទាន, Sene Dolta→សែនដូនតា
music_arts	4	Dù Kê→យីកេ, Ngũ Âm→ពិណពាទ្យ

#	Contribution	Evidence	Status
C1	CulturalMT-ViKm benchmark (1,856 samples, 56 topics)	Dataset	Done
C2	A/B/C linguistic taxonomy of Khmer Krom MT challenges	CKB v2, 132 entries	Done
C3	6-category GPT-4o weakness taxonomy	48 probe samples	Done
C4	CKB v2 + RAG → CuEA 0.419→0.937, 86% error reduction	40-sample exp	Done
C5	Dialogue context → +9.0 chrF++, 80% win rate	10 conversations	Done
C6	CuEA + Script Purity metrics (CulturalEval framework)	Implemented	Done
C7	BLEU≈0 for Vi-Km; CuEA catches what chrF++ misses	All experiments	Done
C8	Multi-model comparison (NLLB, Google Translate, Claude)	TODO	MUST
C9	Human evaluation (2 Krom annotators, 100 samples)	TODO	MUST
C10	Public release: dataset + CKB on HuggingFace	TODO	SHOULD

# Cultural entity (Group A or B)
{"vi": "cốm dẹp", "km": "អំបុក", "km_romanized": "ambok",
 "context": "Flattened rice, used in Ok Om Bok festival",
 "group": "A",  # A=loanword, B=romanized, C=toponym
 "km_cambodia": "optional_cambodia_variant"}  # if differs from Krom

# Toponym (Group C)
{"vi": "Sóc Trăng", "km": "ខេត្តឃ្លាំង", "km_original": "Srok Khleang",
 "meaning": "Land of depositories/silver storage",
 "type": "province",
 "note": "Sốc Kha Lang → Sóc Trăng (Vietnamese phonetic)"}

Vietnamese-Khmer Cultural MT Benchmark

Project Overview

Critical Domain Knowledge

Khmer Krom ≠ Cambodian Khmer

Vietnamese-Khmer Cultural MT Benchmark

Project Overview

Critical Domain Knowledge

Khmer Krom ≠ Cambodian Khmer

A/B/C Linguistic Taxonomy of Khmer Krom MT Challenges

Six GPT-4o Weakness Categories (Proven Experimentally)

Project Structure

Data Schema

Running Experiments

Prerequisites

Azure Client Pattern (Corporate proxy workaround — use everywhere)

Experiment Catalog

Evaluation Metrics

Standard Metrics

Cultural Metrics (our contribution, implemented in `evaluation_framework.py`)

Error Taxonomy (from `classify_errors`)

All Experimental Results Snapshot

Cultural Knowledge Base (CKB v2)

Usage

CKB Categories Quick Reference

Paper Contributions (C1–C10)

Workflows

Adding a New Model to the Benchmark

Extending the CKB

KB Entry Schema

Key Design Decisions

Additional Resources

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Vietnamese-Khmer Cultural MT Benchmark

Project Overview

Critical Domain Knowledge

Khmer Krom ≠ Cambodian Khmer

Vietnamese-Khmer Cultural MT Benchmark

Project Overview

Critical Domain Knowledge

Khmer Krom ≠ Cambodian Khmer

A/B/C Linguistic Taxonomy of Khmer Krom MT Challenges

Six GPT-4o Weakness Categories (Proven Experimentally)

Project Structure

Data Schema

Running Experiments

Prerequisites

Azure Client Pattern (Corporate proxy workaround — use everywhere)

Experiment Catalog

Evaluation Metrics

Standard Metrics

Cultural Metrics (our contribution, implemented in evaluation_framework.py)

Error Taxonomy (from classify_errors)

All Experimental Results Snapshot

Cultural Knowledge Base (CKB v2)

Usage

CKB Categories Quick Reference

Paper Contributions (C1–C10)

Workflows

Adding a New Model to the Benchmark

Extending the CKB

KB Entry Schema

Key Design Decisions

Additional Resources

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Cultural Metrics (our contribution, implemented in `evaluation_framework.py`)

Error Taxonomy (from `classify_errors`)