# Description of Changes
LLM benchmark updates for local development:
- **Local SDK paths**: Templates use relative paths to workspace crates
(`crates/bindings`, `crates/bindings-csharp`,
`crates/bindings-typescript`) instead of published packages, so the
bench runs against local SDK changes.
- **NODEJS_DIR support**: On Windows (e.g. nvm4w), if `pnpm` is not on
PATH, the bench uses `NODEJS_DIR` to locate `pnpm` and prepends it to
PATH for subprocesses.
- **Refactor**: Extracted `relative_to_workspace()` in `templates.rs`
and removed noisy `NODEJS_DIR` logging in `publishers.rs`.
- **Benchmark results**: Updated `docs/llms/llm-comparison-details.json`
and `docs/llms/llm-comparison-summary.json`.
# API and ABI breaking changes
None.
# Expected complexity level and risk
**2** — Local-only changes to the benchmark tool. Templates now require
local SDKs to be built (especially TypeScript: `pnpm build` in
`crates/bindings-typescript`). No impact on published SDKs or runtime.
# Testing
- [ ] Run `cargo llm run --lang rust --modes docs --providers openai`
from repo root
- [ ] Run TypeScript benchmarks with `pnpm build` in
`crates/bindings-typescript` first
- [ ] On Windows with nvm4w, set `NODEJS_DIR` if `pnpm` is not on PATH
and run TypeScript benchmarks
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: clockwork-labs-bot <bot@clockworklabs.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
# Description of Changes
Major documentation overhaul focusing on tables, column types, and
indexes.
**Quickstart Guides:**
- Updated React, TypeScript, Rust, and C# quickstarts with table/reducer
examples
- Fixed CLI syntax (positional `--database` argument)
- Improved template consistency across languages
**Tables Documentation:**
- Added "Why Tables" section explaining table-oriented design philosophy
(tables as fundamental unit, system tables, data-oriented design
principles)
- Added "Physical and Logical Independence" section explaining how
subscription queries use the relational model independently of physical
storage
- Added brief sections linking to related pages (Visibility,
Constraints, Schedule Tables)
- Renamed "Scheduled Tables" to "Schedule Tables" throughout (tables
store schedules; reducers are scheduled)
**Column Types:**
- Split into dedicated page with unified type reference table
- Added "Representing Collections" section (Vec/Array vs table
tradeoffs)
- Added "Binary Data and Files" section for Vec<u8> storage patterns
- Added "Type Performance" section (smaller types, fixed-size types,
column ordering for alignment)
- Added complete example struct demonstrating all type categories
- Renamed "Structured" category to "Composite"
**Indexes:**
- Complete rewrite with textbook-style documentation
- Added "When to Use Indexes" guidance
- Documented single-column and multi-column index syntax (field-level
and table-level)
- Comprehensive range query examples with correct TypeScript `Range`
class syntax
- Explained multi-column index prefix matching semantics
- Added index-accelerated deletion examples
- Included index design guidelines
**Styling:**
- Added CSS for table border radius and row separators
- Created Check component for green checkmarks in tables
# API and ABI breaking changes
None. Documentation only.
# Expected complexity level and risk
1 - Documentation changes only, no code changes.
# Testing
- [ ] Verify docs build without errors
- [ ] Review rendered pages for formatting issues
- [ ] Confirm code examples are syntactically correct
---------
Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>
# Description of Changes
I believe that local users do not have API tokens for OpenAI, so the
existing hint was not helpful. Apparently the correct path is to post
`/update-llm-benchmark` on the PR and let the CI take care of it.
# API and ABI breaking changes
None
# Expected complexity level and risk
1
# Testing
None
---------
Co-authored-by: Zeke Foppa <bfops@users.noreply.github.com>
# Description of Changes
Introduce a new **LLM benchmarking app** and supporting code.
* **CLI:** `llm` with subcommands `run`, `routes list`, `diff`,
`ci-check`.
* **Runner:** executes globally numbered tasks; filters by `--lang`,
`--categories`, `--tasks`, `--providers`, `--models`.
* **Providers/clients:** route layer (`provider:model`) with HTTP LLM
Vendor clients; env-driven keys/base URLs.
* **Evaluation:** deterministic scorers (hash/equality, JSON
shape/count, light schema/reducer parity) with clear failure messages.
* **Results:** stable JSON schema; single-file HTML viewer to
inspect/filter/export CSV.
* **Build & guards:** build script for compile-time setup;
* **Docs:** `DEVELOP.md` includes `cargo llm …` usage.
This PR is the initial addition of the app and its modules (runner,
config, routes, prompt/segmentation, scorers, schema/types,
defaults/constants/paths/hashing/combine, publishers, spacetime guard,
HTML stats viewer).
### How it works
1. **Pick what to run**
* Choose tasks (`--tasks 0,7,12`), or a language (`--lang rust|csharp`),
or categories (`--categories basics,schema`).
* Optionally limit vendors/models (`--providers …`, `--models …`).
2. **Resolve routes**
* Read env (API keys + base URLs) and build the active set (e.g.,
`openai:gpt-5`).
3. **Build context**
* Start Spacetime
* Publish golden answer modules
* Prepare prompts and send to LLM model
* Attempt to publish LLM module
4. **Execute calls**
* Run the selected tasks within each test against selected models and
languages.
5. **Score outputs**
* Apply deterministic scorers (hash/equality, JSON shape/count, simple
schema/reducer checks).
* Record the score and any short failure reason.
6. **Update results file**
* Write/update the single results JSON with task/route outcomes,
timings, and summaries.
# API and ABI breaking changes
None. New application and modules; no existing public APIs/ABIs altered.
# Expected complexity level and risk
**4/5.** New CLI, routing, evaluation, and artifact format.
* External model APIs may rate-limit/timeout; concurrency tunable via
`LLM_BENCH_CONCURRENCY` / `LLM_BENCH_ROUTE_CONCURRENCY`.
# Testing
I ran the full test matrix and generated results for every task against
every vendor, model, and language (rust + C#). I also tested the CI
check locally using [act](https://github.com/nektos/act).
**Please verify**
* [ ] `llm run --tasks 0,1,2` (explicit `run`)
* [ ] `llm run --lang rust --categories basics` (filters)
* [ ] `llm run --categories basics,schema` (multiple categories)
* [ ] `llm run --lang csharp` (language switch)
* [ ] `llm run --providers openai,anthropic --models "openai:gpt-5
anthropic:claude-sonnet-4-5"` (provider/model limits)
* [ ] `llm run --hash-only` (dry integrity)
* [ ] `llm run --goldens-only` (test goldens only)
* [ ] `llm run --force` (skip hash check)
* [ ] `llm ci-check`
* [ ] Stats viewer loads the JSON; filtering and CSV export work
* [ ] CI works as intended
---------
Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>
Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@aol.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>