Files
bradleyshep c5317e1052 LLM benchmarks: run weekly LLM benchmarks from website-managed models (#5324)
### Note 1: this requires a website PR to merge
### Note 2: 
I was able to run all workflow smoke tests successfully, including
golden validation and dry-run benchmarks, except for the C# dry-run
benchmark path. C# golden validation passes, but the C# benchmark dry
run still fails intermittently/consistently on the runner despite
several attempts to align its build/publish setup with the known-good
smoketest path.

```
gh workflow run llm-benchmark-periodic.yml `
  --repo ClockworkLabs/SpacetimeDB `
  --ref bradley/fix-validate-goldens-ci `
  -f model_set=explicit `
  -f models="openrouter:openai/gpt-5.4-mini" `
  -f languages=rust,csharp,typescript `
  -f modes=guidelines `
  -f tasks=t_000_empty_reducers `
  -f dry_run=true
```

# Description of Changes

This updates the LLM benchmark automation and runner plumbing.

- Move periodic LLM benchmark and golden validation workflows from
daily/nightly to weekly Monday UTC runs.
- Add manual workflow inputs for benchmark smoke runs:
  - model set: website-managed, local defaults, or explicit models
  - languages, modes, categories, tasks
  - dry-run mode
- Build the local TypeScript SDK before TypeScript benchmark/golden
validation runs.
- Add support for fetching active/available benchmark models from the
website API via `--model-source remote`.
- Keep explicit `--models ...` working for manual/local overrides.
- Add OpenRouter preflight checks before benchmark execution:
  - checks key/account credits when available
  - probes the selected model when credit balance cannot be checked
  - supports `OPENROUTER_ALLOW_UNCHECKED_CREDITS=1` escape hatch
  - supports `OPENROUTER_MIN_CREDITS` / `LLM_MIN_CREDITS`
- Force scheduled benchmark workflow runs through OpenRouter with
`LLM_VENDOR=openrouter`, while preserving direct OpenAI support for
local/manual use.
- Improve benchmark publishing isolation:
  - isolated SpacetimeDB CLI root per publish
  - serialized C# benchmark publish concurrency
  - local NuGet package references for generated C# benchmark projects
  - Windows/PATH handling for TypeScript `pnpm`
- Update default benchmark model routes to current model names/ids.
- Update TypeScript golden answers for current SDK shape.

# API and ABI breaking changes

None.

This adds benchmark-runner/workflow behavior and CLI options, but does
not change SpacetimeDB runtime API or ABI.

# Expected complexity level and risk

3/5

The changes are mostly isolated to the LLM benchmark runner and GitHub
workflows, but the risk is moderate because they touch CI execution
paths, local SDK build assumptions, website-managed model resolution,
OpenRouter routing, and generated module publish behavior across Rust,
C#, and TypeScript.

The most sensitive pieces are:
- GitHub Actions workflow dispatch/manual input behavior.
- Remote model registry parsing from the website.
- C# benchmark publish behavior on the self-hosted runner.

# Testing

- [x] `cargo check -p xtask-llm-benchmark --bin llm_benchmark`
- [x] `cargo test -p xtask-llm-benchmark --bin llm_benchmark`
- [x] `cargo test -p xtask-llm-benchmark
parses_active_available_model_routes`
- [x] Manual GitHub Actions golden validation smoke runs for Rust, C#,
and TypeScript.
- [ ] Run a dry-run periodic benchmark workflow from this branch with
one explicit OpenRouter model, one task, and all languages.
- [ ] Run a website-dispatched dry-run benchmark and verify it sends
`model_set=explicit` plus selected model/task inputs.

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@clockworklabs.io>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-06-22 18:11:11 +00:00
..
2026-02-06 19:51:53 +00:00