**WORKFLOW SKILL** — AI technical watch: arxiv analysis, paper methodology, benchmark tracking, conference monitoring, reproducibility assessment. USE FOR: analyzing papers, tracking SOTA, evaluating new methods, summarizing research trends, comparing approaches. USE WHEN: evaluating a new technique, staying current on AI research, assessing paper claims.
Track state-of-the-art AI research, analyze papers, and evaluate new methods for practical applicability.
| Concept | Description |
|---|---|
| SOTA Tracking | Monitoring best results on standard benchmarks |
| Paper Analysis | Structured reading: claims, methods, results, limitations |
| Reproducibility | Can results be replicated? Code, data, compute requirements |
| Ablation Study | Which components matter? Sensitivity analysis |
| Transfer Potential |
| Can methods apply to our tasks and constraints? |
For each paper, extract:
Maintain awareness of key benchmarks:
| Benchmark | Measures | Key Models |
|---|---|---|
| MMLU | Knowledge, reasoning | GPT-4, Claude, Llama |
| HumanEval | Code generation | Codex, DeepSeek-Coder |
| HellaSwag | Common sense | Most LLMs |
| ARC | Science reasoning | Most LLMs |
| Benchmark | Measures | Key Models |
|---|---|---|
| LibriSpeech | ASR (WER) | Whisper, Conformer |
| PESQ/STOI | Audio quality | Mimi, EnCodec, DAC |
| MOS | Subjective quality | TTS systems |
| VoiceBox benchmark | Speech generation | VoiceBox, Moshi |
| Benchmark | Measures | Key Models |
|---|---|---|
| ImageNet | Classification | ViT, EfficientNet |
| COCO | Detection/Segmentation | DETR, SAM |
| FID/IS | Image generation quality | Diffusion models |
Track these active research areas:
Score each method for our context:
| Criterion | Weight | Score 1-5 |
|---|---|---|
| Relevance to audio/speech | 30% | |
| Implementation complexity | 20% | |
| Compute requirements | 20% | |
| Quality improvement expected | 20% | |
| Maturity / reproducibility | 10% |
Key papers published by the Kyutai open-source research team:
| Paper | Topic | Relevance |
|---|---|---|
| arxiv:2410.00037 | Moshi: full-duplex speech-text | Core architecture |
| arxiv:2502.03382 | Hibiki: speech translation | Cross-lingual |
| arxiv:2509.06926 | Pocket-TTS: lightweight TTS | Edge deployment |
| arxiv:2505.18825 | LSD: latent speech diffusion | Generative method |
| arxiv:2106.09685 | LoRA | Fine-tuning method |
| arxiv:2104.09864 | RoPE | Position encoding |
| Conference | When | Focus |
|---|---|---|
| NeurIPS | December | General ML |
| ICML | July | General ML |
| ICLR | May | Representation learning |
| ACL/EMNLP | July/December | NLP |
| Interspeech | September | Speech |
| ICASSP | April/June | Signal processing |
| CVPR/ICCV | June/October | Computer vision |