Name: Science Benchmarks
Author: rick-stevens-ai

Overview

Evaluate LLMs on scientific reasoning across 8+ benchmarks using free Argonne compute endpoints (Argo proxy, ALCF Sophia, CELS vLLM). Produces LaTeX tables and comparison dashboards.

Endpoints (all free, API-key authenticated)

Endpoint	URL	Key	Models
Argo Proxy	`http://127.0.0.1:44497/v1` (tunneled)	`stevens`	55 models (GPT-5.x, Claude 4.x, Gemini 2.5, O3/O4)
ALCF Sophia	`http://127.0.0.1:8890/v1` (tunneled)	`stevens`	12 models (Llama, Qwen, Gemma, Mixtral)
CELS vLLM	Various public IPs `:80/v1`	`CELS`	OSS-120B, Llama70, Trinity, Gemma4, K2.5

Script	Benchmark	Notes
`gpqa_eval_v2.py`	GPQA	CoT + direct modes, random guess fallback
`score_aaar_robust.py`	AAAR	Letter extraction from verbose reasoning
`score_bioprob_robust.py`	BioPro	Official + fallback extraction
`score_discoverybench.py`	DiscBench	LLM-as-judge via Argo
`rescore_labbench_v2.py`	LitQA2	Accuracy computation

Science Benchmarks

Overview

Endpoints (all free, API-key authenticated)

Benchmarks

GPQA Diamond (198 tasks, CoT)

Science Benchmarks

Overview

Endpoints (all free, API-key authenticated)

Benchmarks

GPQA Diamond (198 tasks, CoT)

LitQA2 / LAB-Bench (199 tasks)

BixBench (296 MCQ tasks)

BioPro PQA (1200 tasks)

AAAR (1049 equation inference tasks)

DiscoveryBench (239 agentic tasks)

GeoBenchX (202 geospatial tool-calling tasks)

LLM-SRBench (239 equation discovery problems)

CORE-Bench (270 computational reproducibility tasks)

Scoring Scripts

LaTeX Table Generation

Common Issues

Healthcare Cdss Patterns

Drug Discovery

Qmd

Attack Tree Construction

Azure Ai Anomalydetector Java

Viboscope