mirror of https://github.com/clockworklabs/SpacetimeDB.git synced 2026-05-06 07:26:43 -04:00

Files

T

bradleyshep efa6f382b1 LLM benchmark tool updates (#4413 )

# Description of Changes

LLM benchmark updates for local development:

- **Local SDK paths**: Templates use relative paths to workspace crates
(`crates/bindings`, `crates/bindings-csharp`,
`crates/bindings-typescript`) instead of published packages, so the
bench runs against local SDK changes.
- **NODEJS_DIR support**: On Windows (e.g. nvm4w), if `pnpm` is not on
PATH, the bench uses `NODEJS_DIR` to locate `pnpm` and prepends it to
PATH for subprocesses.
- **Refactor**: Extracted `relative_to_workspace()` in `templates.rs`
and removed noisy `NODEJS_DIR` logging in `publishers.rs`.
- **Benchmark results**: Updated `docs/llms/llm-comparison-details.json`
and `docs/llms/llm-comparison-summary.json`.

# API and ABI breaking changes

None.

# Expected complexity level and risk

**2** — Local-only changes to the benchmark tool. Templates now require
local SDKs to be built (especially TypeScript: `pnpm build` in
`crates/bindings-typescript`). No impact on published SDKs or runtime.

# Testing

- [ ] Run `cargo llm run --lang rust --modes docs --providers openai`
from repo root
- [ ] Run TypeScript benchmarks with `pnpm build` in
`crates/bindings-typescript` first
- [ ] On Windows with nvm4w, set `NODEJS_DIR` if `pnpm` is not on PATH
and run TypeScript benchmarks

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: clockwork-labs-bot <bot@clockworklabs.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>

2026-03-01 02:22:59 +00:00

12 KiB

Raw Permalink Blame History

DEVELOP.md

This document explains how to configure the environment, run the LLM benchmark tool, and work with the benchmark suite.

Prerequisites
Quick Checks & Fixes
Environment Variables
Benchmark Suite
Context Construction
Troubleshooting

Prerequisites

Run from repo root — cargo llm and related commands must be run from the workspace root (this repo).
TypeScript benchmarks — Run pnpm build in crates/bindings-typescript first. Rust and C# use local crates that are built as part of the workspace.
Windows (nvm4w) — If pnpm is not found when running TypeScript benchmarks, set NODEJS_DIR to your Node.js bin directory (e.g. C:\nvm\v20.10.0).

Quick Checks & Fixes

Use this single command to quickly unblock CI by regenerating hashes and running only GPT-5 for the minimal Rust + C# passes. This is not the full benchmark suite.

Note: You will need OpenAI API keys to run this locally. Alternatively, any SpacetimeDB member can comment /update-llm-benchmark on a PR to start a CI job to do this.

cargo llm ci-quickfix What this does:

Runs Rust rustdoc_json pass for GPT-5 only.
Runs C# docs pass for GPT-5 only.
Writes updated results & summary.

Model IDs passed to --models must match configured routes (see model_routes.rs), e.g. "openai:gpt-5".

Spacetime CLI

Publishing is performed via the spacetime CLI (spacetime publish -c -y --server <name> <db>). Ensure:

spacetime is on PATH
The target server is reachable/running

Environment Variables

These are the defaults and/or recommended dev values.

Name	Purpose	Values / Example	Required
`SPACETIME_SERVER`	Target SpacetimeDB environment	`local`	✅
`LLM_DEBUG`	Print short debug info while generating	`true` / `false` (default `true` in dev)	✅
`LLM_DEBUG_VERBOSE`	Extra‑verbose logs (payloads, scoring detail)	`false`	✅
`LLM_BENCH_CONCURRENCY`	Parallel task concurrency across the whole bench run	`20`	✅
`LLM_BENCH_ROUTE_CONCURRENCY`	Per‑route concurrency (throttle per vendor/model)	`4`	✅
`OPENAI_API_KEY`	OpenAI credential	`sk-...`	optional*
`OPENAI_BASE_URL`	OpenAI-compatible base URL override	`https://api.openai.com/`	optional
`ANTHROPIC_API_KEY`	Anthropic credential	`...`	optional*
`ANTHROPIC_BASE_URL`	Anthropic base URL override	`https://api.anthropic.com`	optional
`GOOGLE_API_KEY`	Gemini credential	`...`	optional*
`GOOGLE_BASE_URL`	Gemini base URL override	`https://generativelanguage.googleapis.com`	optional
`XAI_API_KEY`	xAI Grok credential	`...`	optional
`DEEPSEEK_API_KEY`	DeepSeek credential	`...`	optional
`META_API_KEY`	Meta Llama credential	`...`	optional*

*Required only if you plan to run that provider locally.

Canonical dev block (copy/paste into your shell profile):

OPENAI_API_KEY=
OPENAI_BASE_URL=https://api.openai.com/

ANTHROPIC_API_KEY=
ANTHROPIC_BASE_URL=https://api.anthropic.com

GOOGLE_API_KEY=
GOOGLE_BASE_URL=https://generativelanguage.googleapis.com

XAI_API_KEY=
XAI_BASE_URL=https://api.x.ai

DEEPSEEK_API_KEY=
DEEPSEEK_BASE_URL=https://api.deepseek.com

META_API_KEY=
META_BASE_URL=https://openrouter.ai/api/v1

SPACETIME_SERVER="local"
LLM_DEBUG=true
LLM_DEBUG_VERBOSE=false
LLM_BENCH_CONCURRENCY=20
LLM_BENCH_ROUTE_CONCURRENCY=4

Windows PowerShell:

$env:SPACETIME_SERVER="local"
$env:LLM_DEBUG="true"
$env:LLM_DEBUG_VERBOSE="false"
$env:LLM_BENCH_CONCURRENCY="20"
$env:LLM_BENCH_ROUTE_CONCURRENCY="4"

LLM Providers — Keys & Base URLs

Notes

These match the providers wired in this repo (OpenAiClient, AnthropicClient, GoogleGeminiClient, XaiGrokClient, DeepSeekClient, MetaLlamaClient).

Provider	API Key Env	Base URL Env (optional)	Default Base URL
OpenAI	`OPENAI_API_KEY`	`OPENAI_BASE_URL`	`https://api.openai.com`
Anthropic	`ANTHROPIC_API_KEY`	`ANTHROPIC_BASE_URL`	`https://api.anthropic.com`
Google Gemini	`GOOGLE_API_KEY`	`GOOGLE_BASE_URL`	`https://generativelanguage.googleapis.com`
xAI Grok	`XAI_API_KEY`	`XAI_BASE_URL`	`https://api.x.ai`
DeepSeek	`DEEPSEEK_API_KEY`	`DEEPSEEK_BASE_URL`	`https://api.deepseek.com`
META	`META_API_KEY`	`META_BASE_URL`	`https://openrouter.ai/api/v1`

Benchmark Suite

Results directory: docs/llms

Result Files

There are two sets of result files, each serving a different purpose:

Files	Purpose	Updated By
`docs-benchmark-details.json` `docs-benchmark-summary.json`	Test documentation quality with a single reference model (GPT-5)	`cargo llm ci-quickfix`
`llm-comparison-details.json` `llm-comparison-summary.json`	Compare all LLMs against the same documentation	`cargo llm run`

docs-benchmark: Used by CI to ensure documentation quality. Contains only GPT-5 results.
llm-comparison: Used for manual benchmark runs to compare LLM performance. Contains results from all configured models.

Results writes are lock-safe and atomic. The tool takes an exclusive lock and writes via a temp file, then renames it, so concurrent runs won't corrupt results.

Open llm_benchmark_stats_viewer.html in a browser to inspect merged results locally.

Current Benchmarks

basics 000. empty-reducers — tests whether it can create basic reducers with various arguments

basic-tables — can it create tables with basic columns
scheduled-table — can it create a scheduled table and reducer
struct-in-table — can it put a struct in a table
insert — can it insert a row
update — can it update a row
delete — can it delete a row
crud — can it insert, update, and delete a row in the same reducer
index-lookup — can it look up something from an index
init — can it write the init reducer
connect — can it write the client_connected/client_disconnected reducers
helper-function — can it create a non-reducer helper function

schema 012. spacetime-product-type — can it define a new spacetime product type 013. spacetime-sum-type — can it define a new sum type 014. elementary-columns — can it create columns with basic types 015. product-type-columns — can it create columns with product types 016. sum-type-columns — can it create columns with sum types 017. scheduled — can it create scheduled columns 018. constraints — can it add primary keys, unique constraints, and indexes 019. many-to-many — can it create a many-to-many relationship 020. ecs — can it create a basic ecs 021. multi-column-index — can it create a multi-column index

Benchmarks live under benchmarks/ with structure like:

benchmarks/
  category/
    t_001_foo/
      tasks/
        rust.txt
        csharp.txt
      answers/
        rust.rs
        csharp.cs
      spec.rs          # scoring config, reducer/schema checks, etc.

Creating a new benchmark

Copy existing benchmark

Duplicate any existing benchmark folder.
Bump the numeric prefix to a new, unused ID: t_123_my_task.

Rename for the new task

Rename the folder to your ID + short slug: t_123_my_task.

Write the task prompt

Create/update tasks/rust.txt and/or tasks/csharp.txt.
Be explicit (tables, reducers, helpers, constraints). Avoid ambiguity.

Add golden answers

Implement the canonical solution in answers/rust.rs and/or answers/csharp.cs.

Define scoring

Edit spec.rs to add scorers (e.g., schema/table/field checks, reducer/func exists).

Quick validation

Build goldens only:
cargo llm run --goldens-only --tasks t_123_my_task

Categorize

Ensure the folder sits under the right category path.

Typical Commands

# Run everything with current env (providers/models from your .env)
cargo llm run

# Only Rust (or C#)
cargo llm run --lang rust
cargo llm run --lang csharp

# Only certain categories (use your actual category names)
cargo llm run --categories basics,schema

# Only certain tasks by number (globally numbered)
cargo llm run --tasks 0,7,12

# Limit providers/models explicitly
cargo llm run \
  --providers openai,anthropic \
  --models "openai:gpt-5 anthropic:claude-sonnet-4-5"

# Dry runs
cargo llm run --hash-only         # build context only (no provider calls)
cargo llm run --goldens-only      # build/check goldens only

# Be aggressive (skip some safety checks)
cargo llm run --force

# CI sanity check per language
cargo llm ci-check --lang rust
cargo llm ci-check --lang csharp

# Generate PR comment markdown (compares against master baseline)
cargo llm ci-comment
# With custom baseline ref
cargo llm ci-comment --baseline-ref origin/main

Outputs:

Logs to stdout/stderr (respecting LLM_DEBUG/LLM_DEBUG_VERBOSE).
JSON results in a per‑run folder (timestamped), merged into aggregate reports.

Context Construction

The benchmark tool constructs a context (documentation) that is sent to the LLM along with each task prompt. The context varies by language and mode.

Modes

Mode	Language	Source	Description
`rustdoc_json`	Rust	`crates/bindings`	Generates rustdoc JSON and extracts documentation from the spacetimedb crate
`docs`	C#	`docs/docs/*/.md`	Concatenates all markdown files from the documentation

Tab Filtering

When building context for a specific language, the tool filters <Tabs> components to only include content relevant to the target language. This reduces noise and helps the LLM focus on the correct syntax.

Filtered tab groupIds:

groupId	Purpose	Tab Values
`server-language`	Server module code examples	`rust`, `csharp`, `typescript`
`client-language`	Client SDK code examples	`rust`, `csharp`, `typescript`, `cpp`, `blueprint`

Filtering behavior:

For C# tests: Only value="csharp" tabs are kept
For Rust tests: Only value="rust" tabs are kept
If no matching tab exists (e.g., client-language with only cpp/blueprint), the entire tabs block is removed

Example transformation:

Before (in markdown):

<Tabs groupId="server-language" queryString>
<TabItem value="csharp" label="C#">
C# code here
</TabItem>
<TabItem value="rust" label="Rust">
Rust code here
</TabItem>
</Tabs>

After (for C# context):

C# code here

Documentation Best Practices

When writing documentation that will be used by the benchmark:

Use consistent tab groupIds: Always use server-language for server module code and client-language for client SDK code
Include all supported languages: Ensure each <Tabs> block has tabs for all languages you want to test
Use consistent naming conventions: The benchmark compares LLM output against golden answers, so documentation should reflect the expected conventions (e.g., PascalCase table names for C#)

Troubleshooting

HTTP 400/404 from providers

Check the model ID spelling and whether it’s available for your account/region.
Verify the correct base URL for non-default gateways.

Timeouts / Rate-limits

Lower LLM_BENCH_CONCURRENCY or LLM_BENCH_ROUTE_CONCURRENCY.
Some providers aggressively throttle bursts; use backoff/retry when supported.

12 KiB Raw Permalink Blame History Unescape Escape