Files
SpacetimeDB/docs/scripts/generate-llms.mjs
bradleyshep be86a512f2 LLM Benchmark Improvements + More Evals (#4740)
# Description of Changes

LLM benchmark infrastructure improvements and new benchmark tasks.

**Runner & scoring:**
- Add retry logic with backoff for LLM API calls (rate limits,
502/503/504, timeouts)
- Fix `generation_duration_ms` to only time the successful attempt, not
retries+sleep delays
- Add `--dry-run` flag to run benchmarks without saving results
- Add OpenRouter client as unified fallback when direct vendor keys
aren't set
- Add web search mode via OpenRouter `:online` suffix
- Extract shared OpenAI-compatible response types into `oa_compat.rs`
- Add `ReducerCallBothScorer` for calling reducers on both golden and
LLM databases
- Set `max_tokens` on OpenRouter and Meta clients to prevent silent
truncation

**Model routing:**
- Add `ModelRoute` with display name, vendor, API model, and OpenRouter
model ID
- Support ad-hoc model IDs via `--models vendor:model` without static
registration
- Add model name normalization (OpenRouter IDs, case variants →
canonical display names)

**Context modes:**
- Add `guidelines`, `cursor_rules`, `search`, `no_context` modes with
`is_empty_context_mode()` helper
- Add mode-specific prompt preambles
- Consolidate mode alias normalization (`none`/`no_guidelines` →
`no_context`)

**CI workflows:**
- Add `llm-benchmark-periodic.yml` for scheduled nightly runs with
per-language failure tracking
- **Note**: The periodic workflow requires `OPENROUTER_API_KEY`,
`LLM_BENCHMARK_UPLOAD_URL`, and `LLM_BENCHMARK_API_KEY` as GitHub
secrets.
- Add `llm-benchmark-validate-goldens.yml` for validating golden answers
still compile

**Results & summary:**
- Add `cmd_status` to show incomplete benchmark combinations with rerun
commands
- Add `cmd_analyze` for LLM-powered failure analysis
- Split `normalize_details_file` from `write_summary_from_details_file`
- Derive task categories from filesystem for summary generation
- Add timestamp tracking (`started_at`/`finished_at`) and token usage

**New benchmark tasks:**
- 30 new tasks across auth, data_modeling, queries, basics, and schema
categories
- Updated/fixed existing task prompts and golden answers

# API and ABI breaking changes

None. Internal tooling only.

# Expected complexity level and risk

2 — Changes are scoped to the LLM benchmark CLI tool
(`xtask-llm-benchmark`) and CI workflows. No impact on SpacetimeDB core.

# Testing

- [x] `cargo check -p xtask-llm-benchmark` — zero errors, zero warnings
- [x] Dry run: `llm_benchmark run --lang typescript --modes no_context
--tasks t_001 --models openai:gpt-5-mini --dry-run` — ran end-to-end,
confirmed no results saved to disk
- [ ] Verify periodic workflow runs successfully on next scheduled
trigger

---------

Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
2026-05-11 22:53:24 +00:00

49 lines
1.3 KiB
JavaScript

#!/usr/bin/env node
/**
* Post-build script: copies the plugin-generated llms.txt to static/llms.md
* so it can be committed to the repo.
*
* Usage: pnpm build && node scripts/generate-llms.mjs
* or: pnpm generate-llms
*/
import { promises as fs } from 'node:fs';
import path from 'node:path';
import { fileURLToPath } from 'node:url';
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const BUILD_DIR = path.resolve(__dirname, '../build');
const STATIC_DIR = path.resolve(__dirname, '../static');
async function findInBuild(filename) {
for (const candidate of [filename, `docs/${filename}`]) {
const p = path.join(BUILD_DIR, candidate);
try {
await fs.access(p);
return p;
} catch {}
}
return null;
}
async function main() {
const src = await findInBuild('llms.txt');
if (!src) {
console.error('Error: llms.txt not found in build output.');
console.error('Run "pnpm build" first to generate it via the plugin.');
process.exit(1);
}
const content = await fs.readFile(src, 'utf8');
const dest = path.join(STATIC_DIR, 'llms.md');
await fs.writeFile(dest, content, 'utf8');
const lines = content.split('\n').length;
console.log(`${src} -> ${dest}`);
console.log(` ${lines} lines`);
}
main().catch((err) => {
console.error(err);
process.exit(1);
});