Files
SpacetimeDB/docs/DEVELOP.md
T
bradleyshep b75bf6decf LLM Benchmarking (#3486)
# Description of Changes

Introduce a new **LLM benchmarking app** and supporting code.

* **CLI:** `llm` with subcommands `run`, `routes list`, `diff`,
`ci-check`.
* **Runner:** executes globally numbered tasks; filters by `--lang`,
`--categories`, `--tasks`, `--providers`, `--models`.
* **Providers/clients:** route layer (`provider:model`) with HTTP LLM
Vendor clients; env-driven keys/base URLs.
* **Evaluation:** deterministic scorers (hash/equality, JSON
shape/count, light schema/reducer parity) with clear failure messages.
* **Results:** stable JSON schema; single-file HTML viewer to
inspect/filter/export CSV.
* **Build & guards:** build script for compile-time setup;
* **Docs:** `DEVELOP.md` includes `cargo llm …` usage.

This PR is the initial addition of the app and its modules (runner,
config, routes, prompt/segmentation, scorers, schema/types,
defaults/constants/paths/hashing/combine, publishers, spacetime guard,
HTML stats viewer).

### How it works
1. **Pick what to run**

* Choose tasks (`--tasks 0,7,12`), or a language (`--lang rust|csharp`),
or categories (`--categories basics,schema`).
   * Optionally limit vendors/models (`--providers …`, `--models …`).

2. **Resolve routes**

* Read env (API keys + base URLs) and build the active set (e.g.,
`openai:gpt-5`).

3. **Build context**

   * Start Spacetime
   * Publish golden answer modules
   * Prepare prompts and send to LLM model
   * Attempt to publish LLM module

4. **Execute calls**

* Run the selected tasks within each test against selected models and
languages.

5. **Score outputs**

* Apply deterministic scorers (hash/equality, JSON shape/count, simple
schema/reducer checks).
   * Record the score and any short failure reason.

6. **Update results file**

* Write/update the single results JSON with task/route outcomes,
timings, and summaries.


# API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

# Expected complexity level and risk

**4/5.** New CLI, routing, evaluation, and artifact format.

* External model APIs may rate-limit/timeout; concurrency tunable via
`LLM_BENCH_CONCURRENCY` / `LLM_BENCH_ROUTE_CONCURRENCY`.

# Testing

I ran the full test matrix and generated results for every task against
every vendor, model, and language (rust + C#). I also tested the CI
check locally using [act](https://github.com/nektos/act).

**Please verify**

* [ ] `llm run --tasks 0,1,2` (explicit `run`)
* [ ] `llm run --lang rust --categories basics` (filters)
* [ ] `llm run --categories basics,schema` (multiple categories)
* [ ] `llm run --lang csharp` (language switch)
* [ ] `llm run --providers openai,anthropic --models "openai:gpt-5
anthropic:claude-sonnet-4-5"` (provider/model limits)
* [ ] `llm run --hash-only` (dry integrity)
* [ ] `llm run --goldens-only` (test goldens only)
* [ ] `llm run --force` (skip hash check)
* [ ] `llm ci-check`
* [ ] Stats viewer loads the JSON; filtering and CSV export work
* [ ] CI works as intended

---------

Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>
Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@aol.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>
2026-01-06 22:22:57 +00:00

8.7 KiB
Raw Blame History

DEVELOP.md

This document explains how to configure the environment, run the LLM benchmark tool, and work with the benchmark suite.


Table of Contents

  1. Quick Checks & Fixes
  2. Environment Variables
  3. Benchmark Suite
  4. Troubleshooting

Quick Checks & Fixes

Use this single command to quickly unblock CI by regenerating hashes and running only GPT-5 for the minimal Rust + C# passes. This is not the full benchmark suite.

cargo llm ci-quickfix What this does:

  1. Runs Rust rustdoc_json pass for GPT-5 only.
  2. Runs C# docs pass for GPT-5 only.
  3. Writes updated results & summary.

Model IDs passed to --models must match configured routes (see model_routes.rs), e.g. "openai:gpt-5".

Spacetime CLI

Publishing is performed via the spacetime CLI (spacetime publish -c -y --server <name> <db>). Ensure:

  • spacetime is on PATH
  • The target server is reachable/running

Environment Variables

These are the defaults and/or recommended dev values.

Name Purpose Values / Example Required
SPACETIME_SERVER Target SpacetimeDB environment local
LLM_DEBUG Print short debug info while generating true / false (default true in dev)
LLM_DEBUG_VERBOSE Extraverbose logs (payloads, scoring detail) false
LLM_BENCH_CONCURRENCY Parallel task concurrency across the whole bench run 20
LLM_BENCH_ROUTE_CONCURRENCY Perroute concurrency (throttle per vendor/model) 4
OPENAI_API_KEY OpenAI credential sk-... optional*
OPENAI_BASE_URL OpenAI-compatible base URL override https://api.openai.com/ optional
ANTHROPIC_API_KEY Anthropic credential ... optional*
ANTHROPIC_BASE_URL Anthropic base URL override https://api.anthropic.com optional
GOOGLE_API_KEY Gemini credential ... optional*
GOOGLE_BASE_URL Gemini base URL override https://generativelanguage.googleapis.com optional
XAI_API_KEY xAI Grok credential ... optional
DEEPSEEK_API_KEY DeepSeek credential ... optional
META_API_KEY Meta Llama credential ... optional*

*Required only if you plan to run that provider locally.

Canonical dev block (copy/paste into your shell profile):

OPENAI_API_KEY=
OPENAI_BASE_URL=https://api.openai.com/

ANTHROPIC_API_KEY=
ANTHROPIC_BASE_URL=https://api.anthropic.com

GOOGLE_API_KEY=
GOOGLE_BASE_URL=https://generativelanguage.googleapis.com

XAI_API_KEY=
XAI_BASE_URL=https://api.x.ai

DEEPSEEK_API_KEY=
DEEPSEEK_BASE_URL=https://api.deepseek.com

META_API_KEY=
META_BASE_URL=https://openrouter.ai/api/v1

SPACETIME_SERVER="local"
LLM_DEBUG=true
LLM_DEBUG_VERBOSE=false
LLM_BENCH_CONCURRENCY=20
LLM_BENCH_ROUTE_CONCURRENCY=4

Windows PowerShell:

$env:SPACETIME_SERVER="local"
$env:LLM_DEBUG="true"
$env:LLM_DEBUG_VERBOSE="false"
$env:LLM_BENCH_CONCURRENCY="20"
$env:LLM_BENCH_ROUTE_CONCURRENCY="4"

LLM Providers — Keys & Base URLs

Notes

  • These match the providers wired in this repo (OpenAiClient, AnthropicClient, GoogleGeminiClient, XaiGrokClient, DeepSeekClient, MetaLlamaClient).
Provider API Key Env Base URL Env (optional) Default Base URL
OpenAI OPENAI_API_KEY OPENAI_BASE_URL https://api.openai.com
Anthropic ANTHROPIC_API_KEY ANTHROPIC_BASE_URL https://api.anthropic.com
Google Gemini GOOGLE_API_KEY GOOGLE_BASE_URL https://generativelanguage.googleapis.com
xAI Grok XAI_API_KEY XAI_BASE_URL https://api.x.ai
DeepSeek DEEPSEEK_API_KEY DEEPSEEK_BASE_URL https://api.deepseek.com
META META_API_KEY META_BASE_URL https://openrouter.ai/api/v1

Benchmark Suite

Results directory: docs/llms

Result Files

There are two sets of result files, each serving a different purpose:

Files Purpose Updated By
docs-benchmark-details.json
docs-benchmark-summary.json
Test documentation quality with a single reference model (GPT-5) cargo llm ci-quickfix
llm-comparison-details.json
llm-comparison-summary.json
Compare all LLMs against the same documentation cargo llm run
  • docs-benchmark: Used by CI to ensure documentation quality. Contains only GPT-5 results.
  • llm-comparison: Used for manual benchmark runs to compare LLM performance. Contains results from all configured models.

Results writes are lock-safe and atomic. The tool takes an exclusive lock and writes via a temp file, then renames it, so concurrent runs won't corrupt results.

Open llm_benchmark_stats_viewer.html in a browser to inspect merged results locally.

Current Benchmarks

basics 000. empty-reducers — tests whether it can create basic reducers with various arguments

  1. basic-tables — can it create tables with basic columns
  2. scheduled-table — can it create a scheduled table and reducer
  3. struct-in-table — can it put a struct in a table
  4. insert — can it insert a row
  5. update — can it update a row
  6. delete — can it delete a row
  7. crud — can it insert, update, and delete a row in the same reducer
  8. index-lookup — can it look up something from an index
  9. init — can it write the init reducer
  10. connect — can it write the client_connected/client_disconnected reducers
  11. helper-function — can it create a non-reducer helper function

schema 012. spacetime-product-type — can it define a new spacetime product type 013. spacetime-sum-type — can it define a new sum type 014. elementary-columns — can it create columns with basic types 015. product-type-columns — can it create columns with product types 016. sum-type-columns — can it create columns with sum types 017. scheduled — can it create scheduled columns 018. constraints — can it add primary keys, unique constraints, and indexes 019. many-to-many — can it create a many-to-many relationship 020. ecs — can it create a basic ecs 021. multi-column-index — can it create a multi-column index

Benchmarks live under benchmarks/ with structure like:

benchmarks/
  category/
    t_001_foo/
      tasks/
        rust.txt
        csharp.txt
      answers/
        rust.rs
        csharp.cs
      spec.rs          # scoring config, reducer/schema checks, etc.

Creating a new benchmark

  1. Copy existing benchmark
  • Duplicate any existing benchmark folder.
  • Bump the numeric prefix to a new, unused ID: t_123_my_task.
  1. Rename for the new task
  • Rename the folder to your ID + short slug: t_123_my_task.
  1. Write the task prompt
  • Create/update tasks/rust.txt and/or tasks/csharp.txt.
  • Be explicit (tables, reducers, helpers, constraints). Avoid ambiguity.
  1. Add golden answers
  • Implement the canonical solution in answers/rust.rs and/or answers/csharp.cs.
  1. Define scoring
  • Edit spec.rs to add scorers (e.g., schema/table/field checks, reducer/func exists).
  1. Quick validation
  • Build goldens only:
    cargo llm run --goldens-only --tasks t_123_my_task
  1. Categorize
  • Ensure the folder sits under the right category path.

Typical Commands

# Run everything with current env (providers/models from your .env)
cargo llm run

# Only Rust (or C#)
cargo llm run --lang rust
cargo llm run --lang csharp

# Only certain categories (use your actual category names)
cargo llm run --categories basics,schema

# Only certain tasks by number (globally numbered)
cargo llm run --tasks 0,7,12

# Limit providers/models explicitly
cargo llm run \
  --providers openai,anthropic \
  --models "openai:gpt-5 anthropic:claude-sonnet-4-5"

# Dry runs
cargo llm run --hash-only         # build context only (no provider calls)
cargo llm run --goldens-only      # build/check goldens only

# Be aggressive (skip some safety checks)
cargo llm run --force

# CI sanity check per language
cargo llm ci-check --lang rust
cargo llm ci-check --lang csharp

Outputs:

  • Logs to stdout/stderr (respecting LLM_DEBUG/LLM_DEBUG_VERBOSE).
  • JSON results in a perrun folder (timestamped), merged into aggregate reports.

Troubleshooting

HTTP 400/404 from providers

  • Check the model ID spelling and whether its available for your account/region.
  • Verify the correct base URL for non-default gateways.

Timeouts / Rate-limits

  • Lower LLM_BENCH_CONCURRENCY or LLM_BENCH_ROUTE_CONCURRENCY.
  • Some providers aggressively throttle bursts; use backoff/retry when supported.