mirror of https://github.com/clockworklabs/SpacetimeDB.git synced 2026-05-14 03:37:55 -04:00

Files

T

bradleyshep b75bf6decf LLM Benchmarking (#3486 )

# Description of Changes

Introduce a new **LLM benchmarking app** and supporting code.

* **CLI:** `llm` with subcommands `run`, `routes list`, `diff`,
`ci-check`.
* **Runner:** executes globally numbered tasks; filters by `--lang`,
`--categories`, `--tasks`, `--providers`, `--models`.
* **Providers/clients:** route layer (`provider:model`) with HTTP LLM
Vendor clients; env-driven keys/base URLs.
* **Evaluation:** deterministic scorers (hash/equality, JSON
shape/count, light schema/reducer parity) with clear failure messages.
* **Results:** stable JSON schema; single-file HTML viewer to
inspect/filter/export CSV.
* **Build & guards:** build script for compile-time setup;
* **Docs:** `DEVELOP.md` includes `cargo llm …` usage.

This PR is the initial addition of the app and its modules (runner,
config, routes, prompt/segmentation, scorers, schema/types,
defaults/constants/paths/hashing/combine, publishers, spacetime guard,
HTML stats viewer).

### How it works
1. **Pick what to run**

* Choose tasks (`--tasks 0,7,12`), or a language (`--lang rust|csharp`),
or categories (`--categories basics,schema`).
   * Optionally limit vendors/models (`--providers …`, `--models …`).

2. **Resolve routes**

* Read env (API keys + base URLs) and build the active set (e.g.,
`openai:gpt-5`).

3. **Build context**

   * Start Spacetime
   * Publish golden answer modules
   * Prepare prompts and send to LLM model
   * Attempt to publish LLM module

4. **Execute calls**

* Run the selected tasks within each test against selected models and
languages.

5. **Score outputs**

* Apply deterministic scorers (hash/equality, JSON shape/count, simple
schema/reducer checks).
   * Record the score and any short failure reason.

6. **Update results file**

* Write/update the single results JSON with task/route outcomes,
timings, and summaries.


# API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

# Expected complexity level and risk

**4/5.** New CLI, routing, evaluation, and artifact format.

* External model APIs may rate-limit/timeout; concurrency tunable via
`LLM_BENCH_CONCURRENCY` / `LLM_BENCH_ROUTE_CONCURRENCY`.

# Testing

I ran the full test matrix and generated results for every task against
every vendor, model, and language (rust + C#). I also tested the CI
check locally using [act](https://github.com/nektos/act).

**Please verify**

* [ ] `llm run --tasks 0,1,2` (explicit `run`)
* [ ] `llm run --lang rust --categories basics` (filters)
* [ ] `llm run --categories basics,schema` (multiple categories)
* [ ] `llm run --lang csharp` (language switch)
* [ ] `llm run --providers openai,anthropic --models "openai:gpt-5
anthropic:claude-sonnet-4-5"` (provider/model limits)
* [ ] `llm run --hash-only` (dry integrity)
* [ ] `llm run --goldens-only` (test goldens only)
* [ ] `llm run --force` (skip hash check)
* [ ] `llm ci-check`
* [ ] Stats viewer loads the JSON; filtering and CSV export work
* [ ] CI works as intended

---------

Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>
Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@aol.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>

2026-01-06 22:22:57 +00:00

8.7 KiB

Raw Blame History

DEVELOP.md

This document explains how to configure the environment, run the LLM benchmark tool, and work with the benchmark suite.

Quick Checks & Fixes
Environment Variables
Benchmark Suite
Troubleshooting

Quick Checks & Fixes

Use this single command to quickly unblock CI by regenerating hashes and running only GPT-5 for the minimal Rust + C# passes. This is not the full benchmark suite.

cargo llm ci-quickfix What this does:

Runs Rust rustdoc_json pass for GPT-5 only.
Runs C# docs pass for GPT-5 only.
Writes updated results & summary.

Model IDs passed to --models must match configured routes (see model_routes.rs), e.g. "openai:gpt-5".

Spacetime CLI

Publishing is performed via the spacetime CLI (spacetime publish -c -y --server <name> <db>). Ensure:

spacetime is on PATH
The target server is reachable/running

Environment Variables

These are the defaults and/or recommended dev values.

Name	Purpose	Values / Example	Required
`SPACETIME_SERVER`	Target SpacetimeDB environment	`local`	✅
`LLM_DEBUG`	Print short debug info while generating	`true` / `false` (default `true` in dev)	✅
`LLM_DEBUG_VERBOSE`	Extra‑verbose logs (payloads, scoring detail)	`false`	✅
`LLM_BENCH_CONCURRENCY`	Parallel task concurrency across the whole bench run	`20`	✅
`LLM_BENCH_ROUTE_CONCURRENCY`	Per‑route concurrency (throttle per vendor/model)	`4`	✅
`OPENAI_API_KEY`	OpenAI credential	`sk-...`	optional*
`OPENAI_BASE_URL`	OpenAI-compatible base URL override	`https://api.openai.com/`	optional
`ANTHROPIC_API_KEY`	Anthropic credential	`...`	optional*
`ANTHROPIC_BASE_URL`	Anthropic base URL override	`https://api.anthropic.com`	optional
`GOOGLE_API_KEY`	Gemini credential	`...`	optional*
`GOOGLE_BASE_URL`	Gemini base URL override	`https://generativelanguage.googleapis.com`	optional
`XAI_API_KEY`	xAI Grok credential	`...`	optional
`DEEPSEEK_API_KEY`	DeepSeek credential	`...`	optional
`META_API_KEY`	Meta Llama credential	`...`	optional*

*Required only if you plan to run that provider locally.

Canonical dev block (copy/paste into your shell profile):

OPENAI_API_KEY=
OPENAI_BASE_URL=https://api.openai.com/

ANTHROPIC_API_KEY=
ANTHROPIC_BASE_URL=https://api.anthropic.com

GOOGLE_API_KEY=
GOOGLE_BASE_URL=https://generativelanguage.googleapis.com

XAI_API_KEY=
XAI_BASE_URL=https://api.x.ai

DEEPSEEK_API_KEY=
DEEPSEEK_BASE_URL=https://api.deepseek.com

META_API_KEY=
META_BASE_URL=https://openrouter.ai/api/v1

SPACETIME_SERVER="local"
LLM_DEBUG=true
LLM_DEBUG_VERBOSE=false
LLM_BENCH_CONCURRENCY=20
LLM_BENCH_ROUTE_CONCURRENCY=4

Windows PowerShell:

$env:SPACETIME_SERVER="local"
$env:LLM_DEBUG="true"
$env:LLM_DEBUG_VERBOSE="false"
$env:LLM_BENCH_CONCURRENCY="20"
$env:LLM_BENCH_ROUTE_CONCURRENCY="4"

LLM Providers — Keys & Base URLs

Notes

These match the providers wired in this repo (OpenAiClient, AnthropicClient, GoogleGeminiClient, XaiGrokClient, DeepSeekClient, MetaLlamaClient).

Provider	API Key Env	Base URL Env (optional)	Default Base URL
OpenAI	`OPENAI_API_KEY`	`OPENAI_BASE_URL`	`https://api.openai.com`
Anthropic	`ANTHROPIC_API_KEY`	`ANTHROPIC_BASE_URL`	`https://api.anthropic.com`
Google Gemini	`GOOGLE_API_KEY`	`GOOGLE_BASE_URL`	`https://generativelanguage.googleapis.com`
xAI Grok	`XAI_API_KEY`	`XAI_BASE_URL`	`https://api.x.ai`
DeepSeek	`DEEPSEEK_API_KEY`	`DEEPSEEK_BASE_URL`	`https://api.deepseek.com`
META	`META_API_KEY`	`META_BASE_URL`	`https://openrouter.ai/api/v1`

Benchmark Suite

Results directory: docs/llms

Result Files

There are two sets of result files, each serving a different purpose:

Files	Purpose	Updated By
`docs-benchmark-details.json` `docs-benchmark-summary.json`	Test documentation quality with a single reference model (GPT-5)	`cargo llm ci-quickfix`
`llm-comparison-details.json` `llm-comparison-summary.json`	Compare all LLMs against the same documentation	`cargo llm run`

docs-benchmark: Used by CI to ensure documentation quality. Contains only GPT-5 results.
llm-comparison: Used for manual benchmark runs to compare LLM performance. Contains results from all configured models.

Results writes are lock-safe and atomic. The tool takes an exclusive lock and writes via a temp file, then renames it, so concurrent runs won't corrupt results.

Open llm_benchmark_stats_viewer.html in a browser to inspect merged results locally.

Current Benchmarks

basics 000. empty-reducers — tests whether it can create basic reducers with various arguments

basic-tables — can it create tables with basic columns
scheduled-table — can it create a scheduled table and reducer
struct-in-table — can it put a struct in a table
insert — can it insert a row
update — can it update a row
delete — can it delete a row
crud — can it insert, update, and delete a row in the same reducer
index-lookup — can it look up something from an index
init — can it write the init reducer
connect — can it write the client_connected/client_disconnected reducers
helper-function — can it create a non-reducer helper function

schema 012. spacetime-product-type — can it define a new spacetime product type 013. spacetime-sum-type — can it define a new sum type 014. elementary-columns — can it create columns with basic types 015. product-type-columns — can it create columns with product types 016. sum-type-columns — can it create columns with sum types 017. scheduled — can it create scheduled columns 018. constraints — can it add primary keys, unique constraints, and indexes 019. many-to-many — can it create a many-to-many relationship 020. ecs — can it create a basic ecs 021. multi-column-index — can it create a multi-column index

Benchmarks live under benchmarks/ with structure like:

benchmarks/
  category/
    t_001_foo/
      tasks/
        rust.txt
        csharp.txt
      answers/
        rust.rs
        csharp.cs
      spec.rs          # scoring config, reducer/schema checks, etc.

Creating a new benchmark

Copy existing benchmark

Duplicate any existing benchmark folder.
Bump the numeric prefix to a new, unused ID: t_123_my_task.

Rename for the new task

Rename the folder to your ID + short slug: t_123_my_task.

Write the task prompt

Create/update tasks/rust.txt and/or tasks/csharp.txt.
Be explicit (tables, reducers, helpers, constraints). Avoid ambiguity.

Add golden answers

Implement the canonical solution in answers/rust.rs and/or answers/csharp.cs.

Define scoring

Edit spec.rs to add scorers (e.g., schema/table/field checks, reducer/func exists).

Quick validation

Build goldens only:
cargo llm run --goldens-only --tasks t_123_my_task

Categorize

Ensure the folder sits under the right category path.

Typical Commands

# Run everything with current env (providers/models from your .env)
cargo llm run

# Only Rust (or C#)
cargo llm run --lang rust
cargo llm run --lang csharp

# Only certain categories (use your actual category names)
cargo llm run --categories basics,schema

# Only certain tasks by number (globally numbered)
cargo llm run --tasks 0,7,12

# Limit providers/models explicitly
cargo llm run \
  --providers openai,anthropic \
  --models "openai:gpt-5 anthropic:claude-sonnet-4-5"

# Dry runs
cargo llm run --hash-only         # build context only (no provider calls)
cargo llm run --goldens-only      # build/check goldens only

# Be aggressive (skip some safety checks)
cargo llm run --force

# CI sanity check per language
cargo llm ci-check --lang rust
cargo llm ci-check --lang csharp

Outputs:

Logs to stdout/stderr (respecting LLM_DEBUG/LLM_DEBUG_VERBOSE).
JSON results in a per‑run folder (timestamped), merged into aggregate reports.

Troubleshooting

HTTP 400/404 from providers

Check the model ID spelling and whether it’s available for your account/region.
Verify the correct base URL for non-default gateways.

Timeouts / Rate-limits

Lower LLM_BENCH_CONCURRENCY or LLM_BENCH_ROUTE_CONCURRENCY.
Some providers aggressively throttle bursts; use backoff/retry when supported.

8.7 KiB Raw Blame History Unescape Escape