Files
SpacetimeDB/docs
bradleyshep b75bf6decf LLM Benchmarking (#3486)
# Description of Changes

Introduce a new **LLM benchmarking app** and supporting code.

* **CLI:** `llm` with subcommands `run`, `routes list`, `diff`,
`ci-check`.
* **Runner:** executes globally numbered tasks; filters by `--lang`,
`--categories`, `--tasks`, `--providers`, `--models`.
* **Providers/clients:** route layer (`provider:model`) with HTTP LLM
Vendor clients; env-driven keys/base URLs.
* **Evaluation:** deterministic scorers (hash/equality, JSON
shape/count, light schema/reducer parity) with clear failure messages.
* **Results:** stable JSON schema; single-file HTML viewer to
inspect/filter/export CSV.
* **Build & guards:** build script for compile-time setup;
* **Docs:** `DEVELOP.md` includes `cargo llm …` usage.

This PR is the initial addition of the app and its modules (runner,
config, routes, prompt/segmentation, scorers, schema/types,
defaults/constants/paths/hashing/combine, publishers, spacetime guard,
HTML stats viewer).

### How it works
1. **Pick what to run**

* Choose tasks (`--tasks 0,7,12`), or a language (`--lang rust|csharp`),
or categories (`--categories basics,schema`).
   * Optionally limit vendors/models (`--providers …`, `--models …`).

2. **Resolve routes**

* Read env (API keys + base URLs) and build the active set (e.g.,
`openai:gpt-5`).

3. **Build context**

   * Start Spacetime
   * Publish golden answer modules
   * Prepare prompts and send to LLM model
   * Attempt to publish LLM module

4. **Execute calls**

* Run the selected tasks within each test against selected models and
languages.

5. **Score outputs**

* Apply deterministic scorers (hash/equality, JSON shape/count, simple
schema/reducer checks).
   * Record the score and any short failure reason.

6. **Update results file**

* Write/update the single results JSON with task/route outcomes,
timings, and summaries.


# API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

# Expected complexity level and risk

**4/5.** New CLI, routing, evaluation, and artifact format.

* External model APIs may rate-limit/timeout; concurrency tunable via
`LLM_BENCH_CONCURRENCY` / `LLM_BENCH_ROUTE_CONCURRENCY`.

# Testing

I ran the full test matrix and generated results for every task against
every vendor, model, and language (rust + C#). I also tested the CI
check locally using [act](https://github.com/nektos/act).

**Please verify**

* [ ] `llm run --tasks 0,1,2` (explicit `run`)
* [ ] `llm run --lang rust --categories basics` (filters)
* [ ] `llm run --categories basics,schema` (multiple categories)
* [ ] `llm run --lang csharp` (language switch)
* [ ] `llm run --providers openai,anthropic --models "openai:gpt-5
anthropic:claude-sonnet-4-5"` (provider/model limits)
* [ ] `llm run --hash-only` (dry integrity)
* [ ] `llm run --goldens-only` (test goldens only)
* [ ] `llm run --force` (skip hash check)
* [ ] `llm ci-check`
* [ ] Stats viewer loads the JSON; filtering and CSV export work
* [ ] CI works as intended

---------

Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>
Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@aol.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>
2026-01-06 22:22:57 +00:00
..
2026-01-06 22:22:57 +00:00
2025-10-24 14:36:38 +00:00
2025-10-24 14:36:38 +00:00
2026-01-06 22:22:57 +00:00
2025-10-24 14:36:38 +00:00
2025-10-24 14:36:38 +00:00
2025-12-16 20:17:51 +00:00
2025-10-24 14:36:38 +00:00
2025-10-24 14:36:38 +00:00

SpacetimeDB Documentation

This repository contains the markdown files which are used to display documentation on our website. This documentation is built using Docusaurus.

Making Edits

To make changes to our docs, you can open a pull request in this repository. You can typically edit the files directly using the GitHub web interface, but you can also clone our repository and make your edits locally.

Instructions

  1. Fork our repository
  2. Clone your fork:
git clone ssh://git@github.com/<username>/SpacetimeDB
cd SpacetimeDB/docs
  1. Make your edits to the docs that you want to make + test them locally (See Testing Locally)
  2. Commit your changes:
git add .
git commit -m "A specific description of the changes I made and why"
  1. Push your changes to your fork as a branch
git checkout -b a-branch-name-that-describes-my-change
git push -u origin a-branch-name-that-describes-my-change
  1. Go to our GitHub and open a PR that references your branch in your fork on your GitHub

CLI Reference Section

To regenerate the CLI reference section, run pnpm generate-cli-docs.

Docusaurus Documentation

For more information on how to use Docusaurus, see the Docusaurus documentation.

Testing Locally

Installation

  1. Make sure you have Node.js installed (version 22 or higher is recommended).
  2. Clone the repository and navigate to the docs directory.
  3. Install the dependencies: pnpm install
  4. Run the development server: pnpm dev, which will start a local server and open a browser window. All changes you make to the markdown files will be reflected live in the browser.

Adding new pages

All of our directory and file names are prefixed with a five-digit number which determines how they're sorted. We started with the hundreds place as the smallest significant digit, to allow using the tens and ones places to add new pages between. When adding a new page in between two existing pages, choose a number which:

  • Doesn't use any more significant figures than it needs to.
  • Is approximately halfway between the previous and next page.

For example, if you want to add a new page between 00300-foo and 00400-bar, name it 00350-baz. To add a new page between 00350-baz and 00400-bar, prefer 00370-quux or 00380-quux, rather than 00375-quux, to avoid populating the ones place.

To add a new page after all previous pages, use the smallest multiple of 100 larger than all other pages. For example, if the highest-numbered existing page is 01350-abc, create 01400-def.

License

This documentation repository is licensed under Apache 2.0. See LICENSE.txt for more details