mirror of
https://github.com/clockworklabs/SpacetimeDB.git
synced 2026-05-06 07:26:43 -04:00
b75bf6decf
# Description of Changes Introduce a new **LLM benchmarking app** and supporting code. * **CLI:** `llm` with subcommands `run`, `routes list`, `diff`, `ci-check`. * **Runner:** executes globally numbered tasks; filters by `--lang`, `--categories`, `--tasks`, `--providers`, `--models`. * **Providers/clients:** route layer (`provider:model`) with HTTP LLM Vendor clients; env-driven keys/base URLs. * **Evaluation:** deterministic scorers (hash/equality, JSON shape/count, light schema/reducer parity) with clear failure messages. * **Results:** stable JSON schema; single-file HTML viewer to inspect/filter/export CSV. * **Build & guards:** build script for compile-time setup; * **Docs:** `DEVELOP.md` includes `cargo llm …` usage. This PR is the initial addition of the app and its modules (runner, config, routes, prompt/segmentation, scorers, schema/types, defaults/constants/paths/hashing/combine, publishers, spacetime guard, HTML stats viewer). ### How it works 1. **Pick what to run** * Choose tasks (`--tasks 0,7,12`), or a language (`--lang rust|csharp`), or categories (`--categories basics,schema`). * Optionally limit vendors/models (`--providers …`, `--models …`). 2. **Resolve routes** * Read env (API keys + base URLs) and build the active set (e.g., `openai:gpt-5`). 3. **Build context** * Start Spacetime * Publish golden answer modules * Prepare prompts and send to LLM model * Attempt to publish LLM module 4. **Execute calls** * Run the selected tasks within each test against selected models and languages. 5. **Score outputs** * Apply deterministic scorers (hash/equality, JSON shape/count, simple schema/reducer checks). * Record the score and any short failure reason. 6. **Update results file** * Write/update the single results JSON with task/route outcomes, timings, and summaries. # API and ABI breaking changes None. New application and modules; no existing public APIs/ABIs altered. # Expected complexity level and risk **4/5.** New CLI, routing, evaluation, and artifact format. * External model APIs may rate-limit/timeout; concurrency tunable via `LLM_BENCH_CONCURRENCY` / `LLM_BENCH_ROUTE_CONCURRENCY`. # Testing I ran the full test matrix and generated results for every task against every vendor, model, and language (rust + C#). I also tested the CI check locally using [act](https://github.com/nektos/act). **Please verify** * [ ] `llm run --tasks 0,1,2` (explicit `run`) * [ ] `llm run --lang rust --categories basics` (filters) * [ ] `llm run --categories basics,schema` (multiple categories) * [ ] `llm run --lang csharp` (language switch) * [ ] `llm run --providers openai,anthropic --models "openai:gpt-5 anthropic:claude-sonnet-4-5"` (provider/model limits) * [ ] `llm run --hash-only` (dry integrity) * [ ] `llm run --goldens-only` (test goldens only) * [ ] `llm run --force` (skip hash check) * [ ] `llm ci-check` * [ ] Stats viewer loads the JSON; filtering and CSV export work * [ ] CI works as intended --------- Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com> Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: Tyler Cloutier <cloutiertyler@aol.com> Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com> Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com> Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>