# Description of Changes
Introduce a new **LLM benchmarking app** and supporting code.
* **CLI:** `llm` with subcommands `run`, `routes list`, `diff`,
`ci-check`.
* **Runner:** executes globally numbered tasks; filters by `--lang`,
`--categories`, `--tasks`, `--providers`, `--models`.
* **Providers/clients:** route layer (`provider:model`) with HTTP LLM
Vendor clients; env-driven keys/base URLs.
* **Evaluation:** deterministic scorers (hash/equality, JSON
shape/count, light schema/reducer parity) with clear failure messages.
* **Results:** stable JSON schema; single-file HTML viewer to
inspect/filter/export CSV.
* **Build & guards:** build script for compile-time setup;
* **Docs:** `DEVELOP.md` includes `cargo llm …` usage.
This PR is the initial addition of the app and its modules (runner,
config, routes, prompt/segmentation, scorers, schema/types,
defaults/constants/paths/hashing/combine, publishers, spacetime guard,
HTML stats viewer).
### How it works
1. **Pick what to run**
* Choose tasks (`--tasks 0,7,12`), or a language (`--lang rust|csharp`),
or categories (`--categories basics,schema`).
* Optionally limit vendors/models (`--providers …`, `--models …`).
2. **Resolve routes**
* Read env (API keys + base URLs) and build the active set (e.g.,
`openai:gpt-5`).
3. **Build context**
* Start Spacetime
* Publish golden answer modules
* Prepare prompts and send to LLM model
* Attempt to publish LLM module
4. **Execute calls**
* Run the selected tasks within each test against selected models and
languages.
5. **Score outputs**
* Apply deterministic scorers (hash/equality, JSON shape/count, simple
schema/reducer checks).
* Record the score and any short failure reason.
6. **Update results file**
* Write/update the single results JSON with task/route outcomes,
timings, and summaries.
# API and ABI breaking changes
None. New application and modules; no existing public APIs/ABIs altered.
# Expected complexity level and risk
**4/5.** New CLI, routing, evaluation, and artifact format.
* External model APIs may rate-limit/timeout; concurrency tunable via
`LLM_BENCH_CONCURRENCY` / `LLM_BENCH_ROUTE_CONCURRENCY`.
# Testing
I ran the full test matrix and generated results for every task against
every vendor, model, and language (rust + C#). I also tested the CI
check locally using [act](https://github.com/nektos/act).
**Please verify**
* [ ] `llm run --tasks 0,1,2` (explicit `run`)
* [ ] `llm run --lang rust --categories basics` (filters)
* [ ] `llm run --categories basics,schema` (multiple categories)
* [ ] `llm run --lang csharp` (language switch)
* [ ] `llm run --providers openai,anthropic --models "openai:gpt-5
anthropic:claude-sonnet-4-5"` (provider/model limits)
* [ ] `llm run --hash-only` (dry integrity)
* [ ] `llm run --goldens-only` (test goldens only)
* [ ] `llm run --force` (skip hash check)
* [ ] `llm ci-check`
* [ ] Stats viewer loads the JSON; filtering and CSV export work
* [ ] CI works as intended
---------
Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>
Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@aol.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>