Files
SpacetimeDB/docs/DEVELOP.md
bradleyshep be86a512f2 LLM Benchmark Improvements + More Evals (#4740)
# Description of Changes

LLM benchmark infrastructure improvements and new benchmark tasks.

**Runner & scoring:**
- Add retry logic with backoff for LLM API calls (rate limits,
502/503/504, timeouts)
- Fix `generation_duration_ms` to only time the successful attempt, not
retries+sleep delays
- Add `--dry-run` flag to run benchmarks without saving results
- Add OpenRouter client as unified fallback when direct vendor keys
aren't set
- Add web search mode via OpenRouter `:online` suffix
- Extract shared OpenAI-compatible response types into `oa_compat.rs`
- Add `ReducerCallBothScorer` for calling reducers on both golden and
LLM databases
- Set `max_tokens` on OpenRouter and Meta clients to prevent silent
truncation

**Model routing:**
- Add `ModelRoute` with display name, vendor, API model, and OpenRouter
model ID
- Support ad-hoc model IDs via `--models vendor:model` without static
registration
- Add model name normalization (OpenRouter IDs, case variants →
canonical display names)

**Context modes:**
- Add `guidelines`, `cursor_rules`, `search`, `no_context` modes with
`is_empty_context_mode()` helper
- Add mode-specific prompt preambles
- Consolidate mode alias normalization (`none`/`no_guidelines` →
`no_context`)

**CI workflows:**
- Add `llm-benchmark-periodic.yml` for scheduled nightly runs with
per-language failure tracking
- **Note**: The periodic workflow requires `OPENROUTER_API_KEY`,
`LLM_BENCHMARK_UPLOAD_URL`, and `LLM_BENCHMARK_API_KEY` as GitHub
secrets.
- Add `llm-benchmark-validate-goldens.yml` for validating golden answers
still compile

**Results & summary:**
- Add `cmd_status` to show incomplete benchmark combinations with rerun
commands
- Add `cmd_analyze` for LLM-powered failure analysis
- Split `normalize_details_file` from `write_summary_from_details_file`
- Derive task categories from filesystem for summary generation
- Add timestamp tracking (`started_at`/`finished_at`) and token usage

**New benchmark tasks:**
- 30 new tasks across auth, data_modeling, queries, basics, and schema
categories
- Updated/fixed existing task prompts and golden answers

# API and ABI breaking changes

None. Internal tooling only.

# Expected complexity level and risk

2 — Changes are scoped to the LLM benchmark CLI tool
(`xtask-llm-benchmark`) and CI workflows. No impact on SpacetimeDB core.

# Testing

- [x] `cargo check -p xtask-llm-benchmark` — zero errors, zero warnings
- [x] Dry run: `llm_benchmark run --lang typescript --modes no_context
--tasks t_001 --models openai:gpt-5-mini --dry-run` — ran end-to-end,
confirmed no results saved to disk
- [ ] Verify periodic workflow runs successfully on next scheduled
trigger

---------

Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
2026-05-11 22:53:24 +00:00

311 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DEVELOP.md
This document explains how to configure the environment, run the LLM benchmark tool, and work with the benchmark suite.
---
## Table of Contents
1. [Prerequisites](#prerequisites)
2. [Quick Checks & Fixes](#quick-checks-fixes)
3. [Environment Variables](#environment-variables)
4. [Benchmark Suite](#benchmark-suite)
5. [Context Construction](#context-construction)
6. [Troubleshooting](#troubleshooting)
---
## Prerequisites
- **Run from repo root** — `cargo llm` and related commands must be run from the workspace root (this repo).
- **TypeScript benchmarks** — Run `pnpm build` in `crates/bindings-typescript` first. Rust and C# use local crates that are built as part of the workspace.
- **Windows (nvm4w)** — If `pnpm` is not found when running TypeScript benchmarks, set `NODEJS_DIR` to your Node.js bin directory (e.g. `C:\nvm\v20.10.0`).
---
## Quick Checks & Fixes
Use this single command to quickly unblock CI by regenerating hashes and running only GPT-5 for the minimal Rust + C# passes. This is not the full benchmark suite.
**Note: You will need OpenAI API keys to run this locally**. Alternatively, any SpacetimeDB member can comment `/update-llm-benchmark` on a PR to start a CI job to do this.
`cargo llm ci-quickfix`
What this does:
1. Runs Rust rustdoc_json pass for GPT-5 only.
2. Runs C# docs pass for GPT-5 only.
3. Writes updated results & summary.
---
> Model IDs passed to `--models` must match configured routes (see `model_routes.rs`), e.g. `"openai:gpt-5"`.
### Spacetime CLI
Publishing is performed via the `spacetime` CLI (`spacetime publish -c -y --server <name> <db>`). Ensure:
- `spacetime` is on PATH
- The target server is reachable/running
## Environment Variables
> These are the **defaults** and/or recommended dev values.
| Name | Purpose | Values / Example | Required |
|---|---|---|---|
| `SPACETIME_SERVER` | Target SpacetimeDB environment | `local` | ✅ |
| `LLM_DEBUG` | Print short debug info while generating | `true` / `false` (default `true` in dev) | ✅ |
| `LLM_DEBUG_VERBOSE` | Extraverbose logs (payloads, scoring detail) | `false` | ✅ |
| `LLM_BENCH_CONCURRENCY` | Parallel task concurrency across the whole bench run | `20` | ✅ |
| `LLM_BENCH_ROUTE_CONCURRENCY` | Perroute concurrency (throttle per vendor/model) | `4` | ✅ |
| `OPENAI_API_KEY` | OpenAI credential | `sk-...` | optional* |
| `OPENAI_BASE_URL` | OpenAI-compatible base URL override | `https://api.openai.com/` | optional |
| `ANTHROPIC_API_KEY` | Anthropic credential | `...` | optional* |
| `ANTHROPIC_BASE_URL` | Anthropic base URL override | `https://api.anthropic.com` | optional |
| `GOOGLE_API_KEY` | Gemini credential | `...` | optional* |
| `GOOGLE_BASE_URL` | Gemini base URL override | `https://generativelanguage.googleapis.com` | optional |
| `XAI_API_KEY` | xAI Grok credential | `...` | optional |
| `DEEPSEEK_API_KEY` | DeepSeek credential | `...` | optional |
| `META_API_KEY` | Meta Llama credential | `...` | optional* |
\*Required only if you plan to run that provider locally.
**Canonical dev block** (copy/paste into your shell profile):
```bash
OPENAI_API_KEY=
OPENAI_BASE_URL=https://api.openai.com/
ANTHROPIC_API_KEY=
ANTHROPIC_BASE_URL=https://api.anthropic.com
GOOGLE_API_KEY=
GOOGLE_BASE_URL=https://generativelanguage.googleapis.com
XAI_API_KEY=
XAI_BASE_URL=https://api.x.ai
DEEPSEEK_API_KEY=
DEEPSEEK_BASE_URL=https://api.deepseek.com
META_API_KEY=
META_BASE_URL=https://openrouter.ai/api/v1
SPACETIME_SERVER="local"
LLM_DEBUG=true
LLM_DEBUG_VERBOSE=false
LLM_BENCH_CONCURRENCY=20
LLM_BENCH_ROUTE_CONCURRENCY=4
```
Windows PowerShell:
```powershell
$env:SPACETIME_SERVER="local"
$env:LLM_DEBUG="true"
$env:LLM_DEBUG_VERBOSE="false"
$env:LLM_BENCH_CONCURRENCY="20"
$env:LLM_BENCH_ROUTE_CONCURRENCY="4"
```
### LLM Providers — Keys & Base URLs
> Notes
> - These match the providers wired in this repo (`OpenAiClient`, `AnthropicClient`, `GoogleGeminiClient`, `XaiGrokClient`, `DeepSeekClient`, `MetaLlamaClient`).
| Provider | API Key Env | Base URL Env (optional) | Default Base URL |
|---------------|---------------------|-------------------------|---|
| OpenAI | `OPENAI_API_KEY` | `OPENAI_BASE_URL` | `https://api.openai.com` |
| Anthropic | `ANTHROPIC_API_KEY` | `ANTHROPIC_BASE_URL` | `https://api.anthropic.com` |
| Google Gemini | `GOOGLE_API_KEY` | `GOOGLE_BASE_URL` | `https://generativelanguage.googleapis.com` |
| xAI Grok | `XAI_API_KEY` | `XAI_BASE_URL` | `https://api.x.ai` |
| DeepSeek | `DEEPSEEK_API_KEY` | `DEEPSEEK_BASE_URL` | `https://api.deepseek.com` |
| META | `META_API_KEY` | `META_BASE_URL` | `https://openrouter.ai/api/v1` |
---
## Benchmark Suite
Results directory: `docs/llms`
### Results Storage
Benchmark results are stored in a remote PostgreSQL database via the spacetime-web API. Results are uploaded automatically after each benchmark batch when `LLM_BENCHMARK_UPLOAD_URL` and `LLM_BENCHMARK_API_KEY` environment variables are set. Use `--dry-run` to skip uploading.
### Current Benchmarks
**basics**
000. empty-reducers — tests whether it can create basic reducers with various arguments
001. basic-tables — can it create tables with basic columns
002. scheduled-table — can it create a scheduled table and reducer
003. struct-in-table — can it put a struct in a table
004. insert — can it insert a row
005. update — can it update a row
006. delete — can it delete a row
007. crud — can it insert, update, and delete a row in the same reducer
008. index-lookup — can it look up something from an index
009. init — can it write the init reducer
010. connect — can it write the client_connected/client_disconnected reducers
011. helper-function — can it create a non-reducer helper function
**schema**
012. spacetime-product-type — can it define a new spacetime product type
013. spacetime-sum-type — can it define a new sum type
014. elementary-columns — can it create columns with basic types
015. product-type-columns — can it create columns with product types
016. sum-type-columns — can it create columns with sum types
017. scheduled — can it create scheduled columns
018. constraints — can it add primary keys, unique constraints, and indexes
019. many-to-many — can it create a many-to-many relationship
020. ecs — can it create a basic ecs
021. multi-column-index — can it create a multi-column index
Benchmarks live under `benchmarks/` with structure like:
```
benchmarks/
category/
t_001_foo/
tasks/
rust.txt
csharp.txt
answers/
rust.rs
csharp.cs
spec.rs # scoring config, reducer/schema checks, etc.
```
### Creating a new benchmark
1. **Copy existing benchmark**
- Duplicate any existing benchmark folder.
- Bump the numeric prefix to a new, unused ID: `t_123_my_task`.
2. **Rename for the new task**
- Rename the folder to your ID + short slug: `t_123_my_task`.
3. **Write the task prompt**
- Create/update `tasks/rust.txt` and/or `tasks/csharp.txt`.
- Be explicit (tables, reducers, helpers, constraints). Avoid ambiguity.
4. **Add golden answers**
- Implement the canonical solution in `answers/rust.rs` and/or `answers/csharp.cs`.
5. **Define scoring**
- Edit `spec.rs` to add scorers (e.g., schema/table/field checks, reducer/func exists).
6. **Quick validation**
- Build goldens only:
`cargo llm run --goldens-only --tasks t_123_my_task`
7. **Categorize**
- Ensure the folder sits under the right category path.
### Typical Commands
```bash
# Run everything with current env (providers/models from your .env)
cargo llm run
# Only Rust (or C#)
cargo llm run --lang rust
cargo llm run --lang csharp
# Only certain categories (use your actual category names)
cargo llm run --categories basics,schema
# Only certain tasks by number (globally numbered)
cargo llm run --tasks 0,7,12
# Limit providers/models explicitly
cargo llm run \
--providers openai,anthropic \
--models "openai:gpt-5 anthropic:claude-sonnet-4-5"
# Dry runs
cargo llm run --hash-only # build context only (no provider calls)
cargo llm run --goldens-only # build/check goldens only
# Be aggressive (skip some safety checks)
cargo llm run --force
# CI sanity check per language
cargo llm ci-check --lang rust
cargo llm ci-check --lang csharp
# Generate PR comment markdown (compares against master baseline)
cargo llm ci-comment
# With custom baseline ref
cargo llm ci-comment --baseline-ref origin/main
```
Outputs:
- Logs to stdout/stderr (respecting `LLM_DEBUG`/`LLM_DEBUG_VERBOSE`).
- JSON results in a perrun folder (timestamped), merged into aggregate reports.
---
## Context Construction
The benchmark tool constructs a context (documentation) that is sent to the LLM along with each task prompt. The context varies by language and mode.
### Modes
| Mode | Language | Source | Description |
|------|----------|--------|-------------|
| `rustdoc_json` | Rust | `crates/bindings` | Generates rustdoc JSON and extracts documentation from the spacetimedb crate |
| `docs` | C# | `docs/docs/**/*.md` | Concatenates all markdown files from the documentation |
### Tab Filtering
When building context for a specific language, the tool filters `<Tabs>` components to only include content relevant to the target language. This reduces noise and helps the LLM focus on the correct syntax.
**Filtered tab groupIds:**
| groupId | Purpose | Tab Values |
|---------|---------|------------|
| `server-language` | Server module code examples | `rust`, `csharp`, `typescript` |
| `client-language` | Client SDK code examples | `rust`, `csharp`, `typescript`, `cpp`, `blueprint` |
**Filtering behavior:**
- For C# tests: Only `value="csharp"` tabs are kept
- For Rust tests: Only `value="rust"` tabs are kept
- If no matching tab exists (e.g., `client-language` with only `cpp`/`blueprint`), the entire tabs block is removed
**Example transformation:**
Before (in markdown):
```html
<Tabs groupId="server-language" queryString>
<TabItem value="csharp" label="C#">
C# code here
</TabItem>
<TabItem value="rust" label="Rust">
Rust code here
</TabItem>
</Tabs>
```
After (for C# context):
```
C# code here
```
### Documentation Best Practices
When writing documentation that will be used by the benchmark:
1. **Use consistent tab groupIds**: Always use `server-language` for server module code and `client-language` for client SDK code
2. **Include all supported languages**: Ensure each `<Tabs>` block has tabs for all languages you want to test
3. **Use consistent naming conventions**: The benchmark compares LLM output against golden answers, so documentation should reflect the expected conventions (e.g., PascalCase table names for C#)
---
## Troubleshooting
**HTTP 400/404 from providers**
- Check the model ID spelling and whether its available for your account/region.
- Verify the correct base URL for non-default gateways.
**Timeouts / Rate-limits**
- Lower `LLM_BENCH_CONCURRENCY` or `LLM_BENCH_ROUTE_CONCURRENCY`.
- Some providers aggressively throttle bursts; use backoff/retry when supported.