mirror of
https://github.com/clockworklabs/SpacetimeDB.git
synced 2026-06-27 00:13:33 -04:00
c5317e1052
### Note 1: this requires a website PR to merge ### Note 2: I was able to run all workflow smoke tests successfully, including golden validation and dry-run benchmarks, except for the C# dry-run benchmark path. C# golden validation passes, but the C# benchmark dry run still fails intermittently/consistently on the runner despite several attempts to align its build/publish setup with the known-good smoketest path. ``` gh workflow run llm-benchmark-periodic.yml ` --repo ClockworkLabs/SpacetimeDB ` --ref bradley/fix-validate-goldens-ci ` -f model_set=explicit ` -f models="openrouter:openai/gpt-5.4-mini" ` -f languages=rust,csharp,typescript ` -f modes=guidelines ` -f tasks=t_000_empty_reducers ` -f dry_run=true ``` # Description of Changes This updates the LLM benchmark automation and runner plumbing. - Move periodic LLM benchmark and golden validation workflows from daily/nightly to weekly Monday UTC runs. - Add manual workflow inputs for benchmark smoke runs: - model set: website-managed, local defaults, or explicit models - languages, modes, categories, tasks - dry-run mode - Build the local TypeScript SDK before TypeScript benchmark/golden validation runs. - Add support for fetching active/available benchmark models from the website API via `--model-source remote`. - Keep explicit `--models ...` working for manual/local overrides. - Add OpenRouter preflight checks before benchmark execution: - checks key/account credits when available - probes the selected model when credit balance cannot be checked - supports `OPENROUTER_ALLOW_UNCHECKED_CREDITS=1` escape hatch - supports `OPENROUTER_MIN_CREDITS` / `LLM_MIN_CREDITS` - Force scheduled benchmark workflow runs through OpenRouter with `LLM_VENDOR=openrouter`, while preserving direct OpenAI support for local/manual use. - Improve benchmark publishing isolation: - isolated SpacetimeDB CLI root per publish - serialized C# benchmark publish concurrency - local NuGet package references for generated C# benchmark projects - Windows/PATH handling for TypeScript `pnpm` - Update default benchmark model routes to current model names/ids. - Update TypeScript golden answers for current SDK shape. # API and ABI breaking changes None. This adds benchmark-runner/workflow behavior and CLI options, but does not change SpacetimeDB runtime API or ABI. # Expected complexity level and risk 3/5 The changes are mostly isolated to the LLM benchmark runner and GitHub workflows, but the risk is moderate because they touch CI execution paths, local SDK build assumptions, website-managed model resolution, OpenRouter routing, and generated module publish behavior across Rust, C#, and TypeScript. The most sensitive pieces are: - GitHub Actions workflow dispatch/manual input behavior. - Remote model registry parsing from the website. - C# benchmark publish behavior on the self-hosted runner. # Testing - [x] `cargo check -p xtask-llm-benchmark --bin llm_benchmark` - [x] `cargo test -p xtask-llm-benchmark --bin llm_benchmark` - [x] `cargo test -p xtask-llm-benchmark parses_active_available_model_routes` - [x] Manual GitHub Actions golden validation smoke runs for Rust, C#, and TypeScript. - [ ] Run a dry-run periodic benchmark workflow from this branch with one explicit OpenRouter model, one task, and all languages. - [ ] Run a website-dispatched dry-run benchmark and verify it sends `model_set=explicit` plus selected model/task inputs. --------- Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@clockworklabs.io> Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>