# Description of Changes Introduce a new **LLM benchmarking app** and supporting code. * **CLI:** `llm` with subcommands `run`, `routes list`, `diff`, `ci-check`. * **Runner:** executes globally numbered tasks; filters by `--lang`, `--categories`, `--tasks`, `--providers`, `--models`. * **Providers/clients:** route layer (`provider:model`) with HTTP LLM Vendor clients; env-driven keys/base URLs. * **Evaluation:** deterministic scorers (hash/equality, JSON shape/count, light schema/reducer parity) with clear failure messages. * **Results:** stable JSON schema; single-file HTML viewer to inspect/filter/export CSV. * **Build & guards:** build script for compile-time setup; * **Docs:** `DEVELOP.md` includes `cargo llm …` usage. This PR is the initial addition of the app and its modules (runner, config, routes, prompt/segmentation, scorers, schema/types, defaults/constants/paths/hashing/combine, publishers, spacetime guard, HTML stats viewer). ### How it works 1. **Pick what to run** * Choose tasks (`--tasks 0,7,12`), or a language (`--lang rust|csharp`), or categories (`--categories basics,schema`). * Optionally limit vendors/models (`--providers …`, `--models …`). 2. **Resolve routes** * Read env (API keys + base URLs) and build the active set (e.g., `openai:gpt-5`). 3. **Build context** * Start Spacetime * Publish golden answer modules * Prepare prompts and send to LLM model * Attempt to publish LLM module 4. **Execute calls** * Run the selected tasks within each test against selected models and languages. 5. **Score outputs** * Apply deterministic scorers (hash/equality, JSON shape/count, simple schema/reducer checks). * Record the score and any short failure reason. 6. **Update results file** * Write/update the single results JSON with task/route outcomes, timings, and summaries. # API and ABI breaking changes None. New application and modules; no existing public APIs/ABIs altered. # Expected complexity level and risk **4/5.** New CLI, routing, evaluation, and artifact format. * External model APIs may rate-limit/timeout; concurrency tunable via `LLM_BENCH_CONCURRENCY` / `LLM_BENCH_ROUTE_CONCURRENCY`. # Testing I ran the full test matrix and generated results for every task against every vendor, model, and language (rust + C#). I also tested the CI check locally using [act](https://github.com/nektos/act). **Please verify** * [ ] `llm run --tasks 0,1,2` (explicit `run`) * [ ] `llm run --lang rust --categories basics` (filters) * [ ] `llm run --categories basics,schema` (multiple categories) * [ ] `llm run --lang csharp` (language switch) * [ ] `llm run --providers openai,anthropic --models "openai:gpt-5 anthropic:claude-sonnet-4-5"` (provider/model limits) * [ ] `llm run --hash-only` (dry integrity) * [ ] `llm run --goldens-only` (test goldens only) * [ ] `llm run --force` (skip hash check) * [ ] `llm ci-check` * [ ] Stats viewer loads the JSON; filtering and CSV export work * [ ] CI works as intended --------- Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com> Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: Tyler Cloutier <cloutiertyler@aol.com> Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com> Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com> Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>
8.7 KiB
DEVELOP.md
This document explains how to configure the environment, run the LLM benchmark tool, and work with the benchmark suite.
Table of Contents
Quick Checks & Fixes
Use this single command to quickly unblock CI by regenerating hashes and running only GPT-5 for the minimal Rust + C# passes. This is not the full benchmark suite.
cargo llm ci-quickfix
What this does:
- Runs Rust rustdoc_json pass for GPT-5 only.
- Runs C# docs pass for GPT-5 only.
- Writes updated results & summary.
Model IDs passed to
--modelsmust match configured routes (seemodel_routes.rs), e.g."openai:gpt-5".
Spacetime CLI
Publishing is performed via the spacetime CLI (spacetime publish -c -y --server <name> <db>). Ensure:
spacetimeis on PATH- The target server is reachable/running
Environment Variables
These are the defaults and/or recommended dev values.
| Name | Purpose | Values / Example | Required |
|---|---|---|---|
SPACETIME_SERVER |
Target SpacetimeDB environment | local |
✅ |
LLM_DEBUG |
Print short debug info while generating | true / false (default true in dev) |
✅ |
LLM_DEBUG_VERBOSE |
Extra‑verbose logs (payloads, scoring detail) | false |
✅ |
LLM_BENCH_CONCURRENCY |
Parallel task concurrency across the whole bench run | 20 |
✅ |
LLM_BENCH_ROUTE_CONCURRENCY |
Per‑route concurrency (throttle per vendor/model) | 4 |
✅ |
OPENAI_API_KEY |
OpenAI credential | sk-... |
optional* |
OPENAI_BASE_URL |
OpenAI-compatible base URL override | https://api.openai.com/ |
optional |
ANTHROPIC_API_KEY |
Anthropic credential | ... |
optional* |
ANTHROPIC_BASE_URL |
Anthropic base URL override | https://api.anthropic.com |
optional |
GOOGLE_API_KEY |
Gemini credential | ... |
optional* |
GOOGLE_BASE_URL |
Gemini base URL override | https://generativelanguage.googleapis.com |
optional |
XAI_API_KEY |
xAI Grok credential | ... |
optional |
DEEPSEEK_API_KEY |
DeepSeek credential | ... |
optional |
META_API_KEY |
Meta Llama credential | ... |
optional* |
*Required only if you plan to run that provider locally.
Canonical dev block (copy/paste into your shell profile):
OPENAI_API_KEY=
OPENAI_BASE_URL=https://api.openai.com/
ANTHROPIC_API_KEY=
ANTHROPIC_BASE_URL=https://api.anthropic.com
GOOGLE_API_KEY=
GOOGLE_BASE_URL=https://generativelanguage.googleapis.com
XAI_API_KEY=
XAI_BASE_URL=https://api.x.ai
DEEPSEEK_API_KEY=
DEEPSEEK_BASE_URL=https://api.deepseek.com
META_API_KEY=
META_BASE_URL=https://openrouter.ai/api/v1
SPACETIME_SERVER="local"
LLM_DEBUG=true
LLM_DEBUG_VERBOSE=false
LLM_BENCH_CONCURRENCY=20
LLM_BENCH_ROUTE_CONCURRENCY=4
Windows PowerShell:
$env:SPACETIME_SERVER="local"
$env:LLM_DEBUG="true"
$env:LLM_DEBUG_VERBOSE="false"
$env:LLM_BENCH_CONCURRENCY="20"
$env:LLM_BENCH_ROUTE_CONCURRENCY="4"
LLM Providers — Keys & Base URLs
Notes
- These match the providers wired in this repo (
OpenAiClient,AnthropicClient,GoogleGeminiClient,XaiGrokClient,DeepSeekClient,MetaLlamaClient).
| Provider | API Key Env | Base URL Env (optional) | Default Base URL |
|---|---|---|---|
| OpenAI | OPENAI_API_KEY |
OPENAI_BASE_URL |
https://api.openai.com |
| Anthropic | ANTHROPIC_API_KEY |
ANTHROPIC_BASE_URL |
https://api.anthropic.com |
| Google Gemini | GOOGLE_API_KEY |
GOOGLE_BASE_URL |
https://generativelanguage.googleapis.com |
| xAI Grok | XAI_API_KEY |
XAI_BASE_URL |
https://api.x.ai |
| DeepSeek | DEEPSEEK_API_KEY |
DEEPSEEK_BASE_URL |
https://api.deepseek.com |
| META | META_API_KEY |
META_BASE_URL |
https://openrouter.ai/api/v1 |
Benchmark Suite
Results directory: docs/llms
Result Files
There are two sets of result files, each serving a different purpose:
| Files | Purpose | Updated By |
|---|---|---|
docs-benchmark-details.jsondocs-benchmark-summary.json |
Test documentation quality with a single reference model (GPT-5) | cargo llm ci-quickfix |
llm-comparison-details.jsonllm-comparison-summary.json |
Compare all LLMs against the same documentation | cargo llm run |
- docs-benchmark: Used by CI to ensure documentation quality. Contains only GPT-5 results.
- llm-comparison: Used for manual benchmark runs to compare LLM performance. Contains results from all configured models.
Results writes are lock-safe and atomic. The tool takes an exclusive lock and writes via a temp file, then renames it, so concurrent runs won't corrupt results.
Open llm_benchmark_stats_viewer.html in a browser to inspect merged results locally.
Current Benchmarks
basics 000. empty-reducers — tests whether it can create basic reducers with various arguments
- basic-tables — can it create tables with basic columns
- scheduled-table — can it create a scheduled table and reducer
- struct-in-table — can it put a struct in a table
- insert — can it insert a row
- update — can it update a row
- delete — can it delete a row
- crud — can it insert, update, and delete a row in the same reducer
- index-lookup — can it look up something from an index
- init — can it write the init reducer
- connect — can it write the client_connected/client_disconnected reducers
- helper-function — can it create a non-reducer helper function
schema 012. spacetime-product-type — can it define a new spacetime product type 013. spacetime-sum-type — can it define a new sum type 014. elementary-columns — can it create columns with basic types 015. product-type-columns — can it create columns with product types 016. sum-type-columns — can it create columns with sum types 017. scheduled — can it create scheduled columns 018. constraints — can it add primary keys, unique constraints, and indexes 019. many-to-many — can it create a many-to-many relationship 020. ecs — can it create a basic ecs 021. multi-column-index — can it create a multi-column index
Benchmarks live under benchmarks/ with structure like:
benchmarks/
category/
t_001_foo/
tasks/
rust.txt
csharp.txt
answers/
rust.rs
csharp.cs
spec.rs # scoring config, reducer/schema checks, etc.
Creating a new benchmark
- Copy existing benchmark
- Duplicate any existing benchmark folder.
- Bump the numeric prefix to a new, unused ID:
t_123_my_task.
- Rename for the new task
- Rename the folder to your ID + short slug:
t_123_my_task.
- Write the task prompt
- Create/update
tasks/rust.txtand/ortasks/csharp.txt. - Be explicit (tables, reducers, helpers, constraints). Avoid ambiguity.
- Add golden answers
- Implement the canonical solution in
answers/rust.rsand/oranswers/csharp.cs.
- Define scoring
- Edit
spec.rsto add scorers (e.g., schema/table/field checks, reducer/func exists).
- Quick validation
- Build goldens only:
cargo llm run --goldens-only --tasks t_123_my_task
- Categorize
- Ensure the folder sits under the right category path.
Typical Commands
# Run everything with current env (providers/models from your .env)
cargo llm run
# Only Rust (or C#)
cargo llm run --lang rust
cargo llm run --lang csharp
# Only certain categories (use your actual category names)
cargo llm run --categories basics,schema
# Only certain tasks by number (globally numbered)
cargo llm run --tasks 0,7,12
# Limit providers/models explicitly
cargo llm run \
--providers openai,anthropic \
--models "openai:gpt-5 anthropic:claude-sonnet-4-5"
# Dry runs
cargo llm run --hash-only # build context only (no provider calls)
cargo llm run --goldens-only # build/check goldens only
# Be aggressive (skip some safety checks)
cargo llm run --force
# CI sanity check per language
cargo llm ci-check --lang rust
cargo llm ci-check --lang csharp
Outputs:
- Logs to stdout/stderr (respecting
LLM_DEBUG/LLM_DEBUG_VERBOSE). - JSON results in a per‑run folder (timestamped), merged into aggregate reports.
Troubleshooting
HTTP 400/404 from providers
- Check the model ID spelling and whether it’s available for your account/region.
- Verify the correct base URL for non-default gateways.
Timeouts / Rate-limits
- Lower
LLM_BENCH_CONCURRENCYorLLM_BENCH_ROUTE_CONCURRENCY. - Some providers aggressively throttle bursts; use backoff/retry when supported.