Recovering Central Document From Embeddings

技能档案

Recovering Central Document From Embeddings

Recover central-document similarity outputs from precomputed embeddings, including mislabeled or corrupted-looking embedding files, and produce validated JSON results. Use when a task asks for the most central document by average cosine similarity.

fuyu123450 星标2026年3月29日

职业
分类: 文档

技能内容

When to Use

Use this skill when:

A task provides precomputed embedding vectors and document IDs.
The goal is to find the single document with highest average cosine similarity to all others.
The expected output is a JSON file with fields like central_doc_id and average_similarity.
The embedding file appears unreadable despite a .npz extension (e.g., mislabeled text/script content).

Minimal Reliable Workflow

Inspect the input file before assuming format.
- Run quick checks:
  - ls -lah <path>
  - python -c "print(open('<path>','rb').read(64))"
- If first bytes look like text/code (e.g., import numpy as np), treat it as mislabeled content, not a real NPZ.
If file is a generator script, execute it to materialize the real archive.