Commit Graph

25 Commits

Author SHA1 Message Date
bradleyshep 55c1b815d4 LLM Benchmark Results - Feb 26 (#4388)
# Description of Changes

- Update llm benchmark results based on latest testing
- Clean up details file to remove old models
- Regenerated the summary file

# API and ABI breaking changes

- None

# Expected complexity level and risk

- 1

# Testing

- Verify the new summary file loads correctly on the website.
2026-02-23 23:15:39 +00:00
Ryan e03b7d7516 Updated C# Name attribute to Accessor (#4306)
# Description of Changes

Implementation of #4295

Convert existing `Name` attribute to `Accessor` to support new Canonical
Case conversation of 2.0

# API and ABI breaking changes

Yes, in C# modules, we no longer use the attribute name `Name`, it
should now be `Accessor`

# Expected complexity level and risk

1

# Testing

- [X] Build and tested locally
- [X] Ran regression tests locally
2026-02-17 01:16:16 +00:00
Shubham Mishra e4098f98d9 Rust: macro change name -> accessor (#4264)
## Description of Changes

This PR primarily affects the `bindings-macro` and `schema` crates to
review:

### Core changes

1. Replaces the `name` macro with `accessor` for **Tables, Views,
Procedures, and Reducers** in Rust modules.
2. Extends `RawModuleDefV10` with a new section for:

   * case conversion policies
   * explicit names
    New sections are not validated in this PR so not functional.
3. Updates index behavior:

* Index names are now always **system-generated** for clients. Which
will be fixed in follow-up PR when we start validating RawModuleDef with
explicit names.
   * The `accessor` name for an index is used only inside the module.


## Breaking changes (API/ABI)

1. **Rust modules**

   * The `name` macro must be replaced with `accessor`.

2. **Client bindings (all languages)**

* Index names are now system-generated instead of using explicitly
provided names.


**Complexity:** 3

A follow-up PR will reintroduce explicit names with support for case
conversion.

---------

Co-authored-by: rekhoff <r.ekhoff@clockworklabs.io>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <bot@clockworklabs.com>
2026-02-16 15:23:50 +00:00
bradleyshep e50b5f3930 LLM Oneshot Prompts, Oneshotted Apps, and Cursor Rules (C#/Rust/TS) (#4032)
# Description of Changes

This PR introduces an **LLM One-Shot App Generation** benchmarking
framework and comprehensive **Cursor rules for SpacetimeDB
development**.

**Cursor Rules (`docs/.cursor/rules/`):**
- `spacetimedb.mdc` - General SpacetimeDB concepts and architecture
- `spacetimedb-csharp.mdc` - C# module and client patterns
- `spacetimedb-rust.mdc` - Rust module development
- `spacetimedb-typescript.mdc` - TypeScript SDK usage

**LLM One-Shot Framework (`tools/llm-oneshot/`):**
- Benchmarking tool to measure how well AI can one-shot SpacetimeDB apps
- Composable prompt system with 12 cumulative feature levels (basic chat
→ full-featured with threading, permissions, presence, etc.)
- Support for 4 tech stacks: TypeScript+SpacetimeDB,
TypeScript+PostgreSQL, Rust+SpacetimeDB, C#+SpacetimeDB
- Additional Cursor rules for deployment, grading, and frontend patterns

**Sample One-Shotted Apps:**
- Multiple chat-app implementations (TypeScript, Rust, C#/MAUI)
- Multiple paint-app implementations (TypeScript, Rust, C#/MAUI)

# API and ABI breaking changes

None - this is a documentation and tooling addition only.

# Expected complexity level and risk

**1** - Trivial addition of documentation, tooling, and example
applications. No changes to core SpacetimeDB functionality. The
`tools/llm-oneshot/` folder is entirely self-contained and the Cursor
rules in `docs/` are purely informational.

# Testing

- [ ] Verified Cursor rules load correctly in IDE
- [ ] Ran one-shot generation with various prompt levels to validate
rules work
- [ ] Sample apps compile and deploy correctly
2026-02-11 18:24:20 +00:00
Tyler Cloutier 504b13ba4a Small docs improvement (#4071)
# Description of Changes

  Two documentation improvements:

1. **Reducers documentation**: Clarified that using global/static
variables in reducers is **undefined behavior**, not just "values won't
persist". Added six specific reasons why this is undefined:
     - Fresh execution environments
     - Module updates
     - Concurrent execution
     - Crash recovery
     - Non-transactional updates
     - Replay safety

2. **Access permissions documentation**: Replaced the "Combining Both
Techniques" example that used indexes on Option fields (which
SpacetimeDB doesn't support) with a working example that filters by a
required `department` field instead.

  # API and ABI breaking changes

  None. Documentation only.

  # Expected complexity level and risk

  1 - Documentation changes only.

  # Testing

  - [ ] Verify the reducers warning is clear and accurate
  - [ ] Verify the access permissions example compiles and makes sense

---------

Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>
Co-authored-by: Zeke Foppa <bfops@users.noreply.github.com>
2026-01-27 21:37:11 +00:00
John Detter 2044a536b0 Upgrade version to 1.12.0 (#4084)
# Description of Changes

<!-- Please describe your change, mention any related tickets, and so on
here. -->

Version upgrade to `v1.12.0`.

# API and ABI breaking changes

<!-- If this is an API or ABI breaking change, please apply the
corresponding GitHub label. -->

None

# Expected complexity level and risk

1 - this is just a version upgrade

<!--
How complicated do you think these changes are? Grade on a scale from 1
to 5,
where 1 is a trivial change, and 5 is a deep-reaching and complex
change.

This complexity rating applies not only to the complexity apparent in
the diff,
but also to its interactions with existing and future code.

If you answered more than a 2, explain what is complex about the PR,
and what other components it interacts with in potentially concerning
ways. -->

# Testing

<!-- Describe any testing you've done, and any testing you'd like your
reviewers to do,
so that you're confident that all the changes work as expected! -->

The testsuite failures are fixed by
https://github.com/clockworklabs/SpacetimeDB/pull/4120

- [x] License has been properly updated including version number and
date
- [x] CI passes

---------

Co-authored-by: rekhoff <r.ekhoff@clockworklabs.io>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-27 18:15:36 +00:00
clockwork-tien f7a22018d0 feat: create Vue framework sdk (#4037)
# Description of Changes

The PR implements the following updates:
- Create Vue framework sdk
- Add Vue Quickstart Docs
- Create `vue-ts` template

# Screenshots
- `vue-ts` template
<img width="1512" height="862" alt="Screenshot"
src="https://github.com/user-attachments/assets/15c8209b-ec7f-4f4a-a5b4-5174ddd068be"
/>

- Vue Quickstart Docs
<img width="1392" height="854" alt="image"
src="https://github.com/user-attachments/assets/57650232-81fa-43d3-be5a-135aa1799f05"
/>


<!-- Please describe your change, mention any related tickets, and so on
here. -->

# API and ABI breaking changes

<!-- If this is an API or ABI breaking change, please apply the
corresponding GitHub label. -->

# Expected complexity level and risk

<!--
How complicated do you think these changes are? Grade on a scale from 1
to 5,
where 1 is a trivial change, and 5 is a deep-reaching and complex
change.

This complexity rating applies not only to the complexity apparent in
the diff,
but also to its interactions with existing and future code.

If you answered more than a 2, explain what is complex about the PR,
and what other components it interacts with in potentially concerning
ways. -->

# Testing

<!-- Describe any testing you've done, and any testing you'd like your
reviewers to do,
so that you're confident that all the changes work as expected! -->

- [ ] <!-- maybe a test you want to do -->
- [ ] <!-- maybe a test you want a reviewer to do, so they can check it
off when they're satisfied. -->

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-26 16:41:18 +00:00
Zeke Foppa 7f6fd18018 Re-run the LLM benchmarks update (#4110)
# Description of Changes

Try fixing this again? It seems to pass on PRs if re-run.

# API and ABI breaking changes

None.

# Expected complexity level and risk

1

# Testing

- [x] It passes on this PR now 🤷

---------

Co-authored-by: Zeke Foppa <bfops@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-23 22:58:10 +00:00
Piotr Sarnacki cd1ec90d16 Templates naming standarization (#4042)
# Description of Changes

This PR renames the templates to always use shorthand for the language,
specify a framework (or console) if necessary, and shorten the naming in
general

# Expected complexity level and risk

1

# Testing

I've tested generating templates manually

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-23 16:08:23 +00:00
Tyler Cloutier 3b9497e318 Empty commit basically (#4088)
Empty commit to fix llm benchmark.

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-21 20:41:38 -05:00
Tyler Cloutier b5a7b37660 Fixes C# benchmark test failures caused by table naming convention mismatches (#4059)
# Description of Changes

Adds TypeScript as a third language for LLM benchmark tests alongside
Rust and C#, and fixes table naming convention mismatches.

  **TypeScript Support:**
  - Added `Lang::TypeScript` variant with camelCase naming conventions
- Created TypeScript project template (`templates/typescript/server/`)
with package.json, tsconfig.json, and index.ts
- Added TypeScript publisher that uses `spacetime build` and `spacetime
publish`
- Created 22 TypeScript task prompts and golden answer files for all
benchmark tests
  - Updated prompt discovery to find `tasks/typescript.txt` files

  **Table Naming Fix:**
  - Standardized on singular table names across all languages:
    - Rust: `user` (snake_case singular)
    - C#: `User` (PascalCase singular)  
    - TypeScript: `user` (camelCase singular)
- Updated `table_name()` helper to convert singular names to appropriate
case per language
- Updated all spec.rs files to use `table_name("user", lang)` instead of
hardcoded `"users"`

  **CI/Hashing Improvements:**
- Added `compute_processed_context_hash()` for language-specific hash
computation after tab filtering
- Updated CI check to verify both `rustdoc_json` and `docs` modes for
Rust
  - Fixed `--hash-only` mode to skip golden builds

  # API and ABI breaking changes

  None - these are internal benchmark tooling changes only.

  # Expected complexity level and risk

  **Complexity: 2**

The changes add a new language following existing patterns for Rust and
C#. The table naming fixes are straightforward find-and-replace style
updates. Low risk since this only affects the benchmark tooling, not the
core SpacetimeDB codebase.

  # Testing

  - [x] `cargo build -p xtask-llm-benchmark` compiles successfully
  - [x] All 22 TypeScript golden modules build and publish successfully
  - [x] Rust and C# benchmarks unaffected by changes

---------

Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>
2026-01-21 03:51:33 +00:00
Julien Lavocat 7f4a6203a4 Add Clerk tutorial (#4062)
# Description of Changes

Add Clerk tutorial

# API and ABI breaking changes

None.

# Expected complexity level and risk

1

# Testing

Tested locally

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-21 01:09:29 +00:00
Julien Lavocat 283f03ac49 Add Auth0 tutorial (#4048)
# Description of Changes

Add Auth0 tutorial 

<img width="1612" height="1350" alt="image"
src="https://github.com/user-attachments/assets/077c156c-ec01-4fb5-aed2-ed728eac049c"
/>
<img width="1612" height="1350" alt="image"
src="https://github.com/user-attachments/assets/075f42e5-6cb1-4766-bd68-169c6ed70be9"
/>


# API and ABI breaking changes

None.

# Expected complexity level and risk

1

# Testing

- Followed rendered tutorial on the locally running website

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-20 15:03:12 +00:00
Tyler Cloutier 73881e38f7 Further misc docs changes (#4029)
# Description of Changes

Major documentation overhaul focusing on tables, column types, and
indexes.

  **Quickstart Guides:**
- Updated React, TypeScript, Rust, and C# quickstarts with table/reducer
examples
  - Fixed CLI syntax (positional `--database` argument)
  - Improved template consistency across languages

  **Tables Documentation:**
- Added "Why Tables" section explaining table-oriented design philosophy
(tables as fundamental unit, system tables, data-oriented design
principles)
- Added "Physical and Logical Independence" section explaining how
subscription queries use the relational model independently of physical
storage
- Added brief sections linking to related pages (Visibility,
Constraints, Schedule Tables)
- Renamed "Scheduled Tables" to "Schedule Tables" throughout (tables
store schedules; reducers are scheduled)

  **Column Types:**
  - Split into dedicated page with unified type reference table
- Added "Representing Collections" section (Vec/Array vs table
tradeoffs)
  - Added "Binary Data and Files" section for Vec<u8> storage patterns
- Added "Type Performance" section (smaller types, fixed-size types,
column ordering for alignment)
  - Added complete example struct demonstrating all type categories
  - Renamed "Structured" category to "Composite"

  **Indexes:**
  - Complete rewrite with textbook-style documentation
  - Added "When to Use Indexes" guidance
- Documented single-column and multi-column index syntax (field-level
and table-level)
- Comprehensive range query examples with correct TypeScript `Range`
class syntax
  - Explained multi-column index prefix matching semantics
  - Added index-accelerated deletion examples
  - Included index design guidelines

  **Styling:**
  - Added CSS for table border radius and row separators
  - Created Check component for green checkmarks in tables

  # API and ABI breaking changes

  None. Documentation only.

  # Expected complexity level and risk

  1 - Documentation changes only, no code changes.

  # Testing

  - [ ] Verify docs build without errors
  - [ ] Review rendered pages for formatting issues
  - [ ] Confirm code examples are syntactically correct

---------

Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Signed-off-by: John Detter <4099508+jdetter@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>
2026-01-17 17:44:58 +00:00
Boegie19 eb11d67f91 Fix LLM benchmark Rust wrong struct name (#4043)
# Description of Changes

<!-- Please describe your change, mention any related tickets, and so on
here. -->
Fix LLM benachmarks in rust since they used Result instead of ResultRow
in the request to the LLM making it always fail.
1. see the answer's file there it is ResultRow.
2. Result is a keyword so it will always fail

# API and ABI breaking changes

<!-- If this is an API or ABI breaking change, please apply the
corresponding GitHub label. -->

# Expected complexity level and risk

<!--
How complicated do you think these changes are? Grade on a scale from 1
to 5,
where 1 is a trivial change, and 5 is a deep-reaching and complex
change.

This complexity rating applies not only to the complexity apparent in
the diff,
but also to its interactions with existing and future code.

If you answered more than a 2, explain what is complex about the PR,
and what other components it interacts with in potentially concerning
ways. -->
1 
# Testing

<!-- Describe any testing you've done, and any testing you'd like your
reviewers to do,
so that you're confident that all the changes work as expected! -->

Run /update-llm-benchmark to see if more passes.

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
Co-authored-by: Zeke Foppa <bfops@users.noreply.github.com>
2026-01-16 03:56:17 +00:00
Kim Altintop 4a4bb0cc5a docs: Add confirmed reads configuration to SDK reference docs (#4036)
Confirmed reads are arguably something users should be aware of.

# Expected complexity level and risk

1

# Testing

Documentation update.

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-15 10:16:20 +00:00
Julien Lavocat 9cf40f4998 Update authentication docs (#4020)
# Description of Changes

- Add an "Authentication" section under "Core concepts"
- Add an introduction about how and explanations about common topics
- Move claims usage out of SpacetimeAuth docs
- Update SpacetimeAuth docs to the new version of the dashboard

# API and ABI breaking changes

None.

# Expected complexity level and risk

2

# Testing

Tested locally using Docusaurus

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-14 14:32:46 +00:00
Zeke Foppa c27db13911 Docs - insert -> try_insert (#4025)
# Description of Changes

Fix a small docs thing that a user noticed in
https://github.com/clockworklabs/SpacetimeDB/issues/3970.

# API and ABI breaking changes

None

# Expected complexity level and risk

1

# Testing

- [x] I copy-pasted the new `client_connected` reducer into a test
project and ran `spacetime build`

---------

Co-authored-by: Zeke Foppa <bfops@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-14 00:13:02 +00:00
Tyler Cloutier 72574695f8 Fixes basic issues using the basic-react template. (#4017)
# Description of Changes

- Made spacetime dev <database> a positional argument and deprecated
--database
- Fixed double connection in React SDK
- Added a more descriptive error message to unresolved table name.

# API and ABI breaking changes

Deprecates `--database`. Still works, but it prints with a warning.

# Expected complexity level and risk

2

# Testing

- [x] I have tested that the double render fix works in React

---------

Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-13 19:13:52 +00:00
Tyler Cloutier d78517fd9a Misc docs and small CLI improvements (#3953)
# Description of Changes

This PR does several small things:

1. It removes the explicit `h1` tags on every page, and either uses the
side bar title directly, or puts it in the frontmatter
2. It merges what are currently called quickstarts into a single Chat
App Tutorial
3. It creates new quickstarts which just use `spacetime dev --template`
to get you up and running quickly
4. It adds a "The Zen of SpacetimeDB" page much like the Zen of Python
which goes over the 5 key principles of SpacetimeDB
5. It reorders all Tabs groups so that the ordering is `TypeScript`,
`C#`, `Rust`, `Unreal`, `C++`, `Blueprints` (order of decreasing
popularity).
6. It improves the sidebar navigation by having categories act as
overview pages, and also fixes the breadcrumbs
7. It fixes various small typos and issues
8. Closes #3610 and adds cursor rules files generally
9. It fixes general styling on the docs page by bring it inline with the
UI design:

Old:
<img width="1678" height="958" alt="image"
src="https://github.com/user-attachments/assets/f36efee6-b81a-4463-a179-da68b3a7152e"
/>

New:
<img width="1678" height="957" alt="image"
src="https://github.com/user-attachments/assets/f430f77d-0663-47f2-9727-45cbfe10e4c7"
/>


https://github.com/user-attachments/assets/adc5a78a-ada8-45b5-8078-a45cb81477a3

# API and ABI breaking changes

This PR does NOT change any old links. It does add new pages though.

# Expected complexity level and risk

3 - it's a large change. I manually tested the TypeScript Chat App
Tutorial but I have not gone through the Rust and C# quickstarts.
However, we have testing on the quickstarts and this is text only so can
be carefully reviewed.

# Testing

<!-- Describe any testing you've done, and any testing you'd like your
reviewers to do,
so that you're confident that all the changes work as expected! -->

- [x] Ran through each step of the Chat App TypeScript tutorial to
ensure it is working
- [x] Ran and tested the styles and the functionality of the side bar

---------

Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com>
Co-authored-by: clockworklabs-bot <clockworklabs-bot@users.noreply.github.com>
Co-authored-by: Zeke Foppa <bfops@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-01-13 00:14:48 +00:00
homersimpsons 1053c5f4d8 Fix link to Blackholio demo in Unreal tutorial and update project structure in README (#3994)
# Description of Changes

Updated the link to target SpacetimeDB repository instead of archived
blackholio repository.

Update the repository structure in the README.md to match the current
one (also sort it alphabetically).

# API and ABI breaking changes

NA

# Expected complexity level and risk

1

# Testing

Visit https://spacetimedb.com/docs/unreal/. this link is already used a
bit below for the text "Blackhol.io"

---------

Signed-off-by: homersimpsons <guillaume.alabre@gmail.com>
Signed-off-by: Zeke Foppa <196249+bfops@users.noreply.github.com>
Co-authored-by: Zeke Foppa <196249+bfops@users.noreply.github.com>
Co-authored-by: Zeke Foppa <bfops@users.noreply.github.com>
2026-01-12 22:29:08 +00:00
Kilian Strunz cbe967e736 Combine table and index attributes in Rust example (#3948)
# Description of Changes

The syntax is not supported like shows and must be included in the
`table` macro.

<!-- Please describe your change, mention any related tickets, and so on
here. -->

# API and ABI breaking changes

None.

<!-- If this is an API or ABI breaking change, please apply the
corresponding GitHub label. -->

# Expected complexity level and risk

0 No runtime change

<!--
How complicated do you think these changes are? Grade on a scale from 1
to 5,
where 1 is a trivial change, and 5 is a deep-reaching and complex
change.

This complexity rating applies not only to the complexity apparent in
the diff,
but also to its interactions with existing and future code.

If you answered more than a 2, explain what is complex about the PR,
and what other components it interacts with in potentially concerning
ways. -->

# Testing

<!-- Describe any testing you've done, and any testing you'd like your
reviewers to do,
so that you're confident that all the changes work as expected! -->
- [x] Code compiles now.

---------

Signed-off-by: Kilian Strunz <93079615+kistz@users.noreply.github.com>
Co-authored-by: Zeke Foppa <196249+bfops@users.noreply.github.com>
Co-authored-by: Zeke Foppa <bfops@users.noreply.github.com>
2026-01-12 03:16:03 +00:00
Tyler Cloutier d3c2459bf0 Update 00300-rust.md (#3943)
# Description of Changes

Minor docs typo fix suggested here:
https://discord.com/channels/1037340874172014652/1452259224976625737/1452259224976625737

# API and ABI breaking changes

None

# Expected complexity level and risk

1

---------

Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com>
2026-01-11 19:46:51 +00:00
Piotr Sarnacki 3c8836b1a3 Templates rework (#3879)
# Description of Changes

We would like to move all of the templates to a central directory

# API and ABI breaking changes

None

# Expected complexity level and risk

2

# Testing

---------

Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com>
2026-01-09 15:09:26 +00:00
bradleyshep b75bf6decf LLM Benchmarking (#3486)
# Description of Changes

Introduce a new **LLM benchmarking app** and supporting code.

* **CLI:** `llm` with subcommands `run`, `routes list`, `diff`,
`ci-check`.
* **Runner:** executes globally numbered tasks; filters by `--lang`,
`--categories`, `--tasks`, `--providers`, `--models`.
* **Providers/clients:** route layer (`provider:model`) with HTTP LLM
Vendor clients; env-driven keys/base URLs.
* **Evaluation:** deterministic scorers (hash/equality, JSON
shape/count, light schema/reducer parity) with clear failure messages.
* **Results:** stable JSON schema; single-file HTML viewer to
inspect/filter/export CSV.
* **Build & guards:** build script for compile-time setup;
* **Docs:** `DEVELOP.md` includes `cargo llm …` usage.

This PR is the initial addition of the app and its modules (runner,
config, routes, prompt/segmentation, scorers, schema/types,
defaults/constants/paths/hashing/combine, publishers, spacetime guard,
HTML stats viewer).

### How it works
1. **Pick what to run**

* Choose tasks (`--tasks 0,7,12`), or a language (`--lang rust|csharp`),
or categories (`--categories basics,schema`).
   * Optionally limit vendors/models (`--providers …`, `--models …`).

2. **Resolve routes**

* Read env (API keys + base URLs) and build the active set (e.g.,
`openai:gpt-5`).

3. **Build context**

   * Start Spacetime
   * Publish golden answer modules
   * Prepare prompts and send to LLM model
   * Attempt to publish LLM module

4. **Execute calls**

* Run the selected tasks within each test against selected models and
languages.

5. **Score outputs**

* Apply deterministic scorers (hash/equality, JSON shape/count, simple
schema/reducer checks).
   * Record the score and any short failure reason.

6. **Update results file**

* Write/update the single results JSON with task/route outcomes,
timings, and summaries.


# API and ABI breaking changes

None. New application and modules; no existing public APIs/ABIs altered.

# Expected complexity level and risk

**4/5.** New CLI, routing, evaluation, and artifact format.

* External model APIs may rate-limit/timeout; concurrency tunable via
`LLM_BENCH_CONCURRENCY` / `LLM_BENCH_ROUTE_CONCURRENCY`.

# Testing

I ran the full test matrix and generated results for every task against
every vendor, model, and language (rust + C#). I also tested the CI
check locally using [act](https://github.com/nektos/act).

**Please verify**

* [ ] `llm run --tasks 0,1,2` (explicit `run`)
* [ ] `llm run --lang rust --categories basics` (filters)
* [ ] `llm run --categories basics,schema` (multiple categories)
* [ ] `llm run --lang csharp` (language switch)
* [ ] `llm run --providers openai,anthropic --models "openai:gpt-5
anthropic:claude-sonnet-4-5"` (provider/model limits)
* [ ] `llm run --hash-only` (dry integrity)
* [ ] `llm run --goldens-only` (test goldens only)
* [ ] `llm run --force` (skip hash check)
* [ ] `llm ci-check`
* [ ] Stats viewer loads the JSON; filtering and CSV export work
* [ ] CI works as intended

---------

Signed-off-by: bradleyshep <148254416+bradleyshep@users.noreply.github.com>
Signed-off-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@aol.com>
Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: spacetimedb-bot <spacetimedb-bot@users.noreply.github.com>
Co-authored-by: John Detter <4099508+jdetter@users.noreply.github.com>
2026-01-06 22:22:57 +00:00