Files
bradleyshep fb0a458d4f LLM Benchmark: Sequential Upgrades Test (#4817)
# Description of Changes

AI app generation benchmark comparing SpacetimeDB vs PostgreSQL (Express
+ Socket.io + Drizzle ORM). Same AI model (Claude Sonnet 4.6), same
prompts, same chat app, two backends. Upgraded through 12 feature
levels, manually graded at each level, bugs fixed, all costs measured
via OpenTelemetry.

Results viewable at:
https://spacetimedb.com/llms-benchmark-sequential-upgrade

## Benchmark harness (`tools/llm-sequential-upgrade/`)

- `run.sh`: orchestrates headless Claude Code sessions for code
generation, sequential upgrades, and bug fixes. Tracks all API costs via
OTel. Supports `--upgrade`, `--fix`, `--composed-prompt`,
`--resume-session` modes.
- `grade.sh` / `grade-agents.sh`: grading harnesses for manual testing
of generated apps.
- `docker-compose.otel.yaml`: OTel collector + PostgreSQL services.
- `generate-report.mjs` / `parse-telemetry.mjs`: aggregate per-session
telemetry into cost reports.
- Backend guidelines in `backends/`: SpacetimeDB SDK reference, config
templates, server setup docs, PostgreSQL setup with Drizzle/Socket.io
guidance.
**After https://github.com/clockworklabs/SpacetimeDB/pull/4740 merges,
we will likely want to update this so that it reads backend and SDK
guidance from SKILLS**

## Two complete benchmark runs

**Run 1 (20260403):** Original methodology.
**Run 2 (20260406):** Refined methodology with domain bias removed from
SpacetimeDB SDK docs and PostgreSQL instructions made
feature-spec-neutral.
**Note: no meaningful changes in results were observed with these
changes. Domain familiarity biases were very small and almost certainly
not the cause of STDB's major gains over PG stack.**

Each run contains full L1-L12 app source for both backends, level
snapshots preserving state before each upgrade, and per-session OTel
cost summaries.

## 12 feature levels

| Level | Feature |
|---|---|
| L1 | Basic Chat + Typing + Read Receipts + Unread Counts |
| L2 | Scheduled Messages |
| L3 | Ephemeral Messages |
| L4 | Message Reactions |
| L5 | Message Editing with History |
| L6 | Real-Time Permissions (kick, ban, promote) |
| L7 | Rich User Presence |
| L8 | Message Threading |
| L9 | Private Rooms + Direct Messages |
| L10 | Room Activity Indicators |
| L11 | Draft Sync |
| L12 | Anonymous to Registered Migration |

## Results

| | Run 1 (20260403) | Run 2 (20260406) |
|---|---|---|
| **SpacetimeDB total cost** | $13.33 | $12.62 |
| **PostgreSQL total cost** | $17.80 | $19.68 |
| **SpacetimeDB bugs** | 5 | 2 |
| **PostgreSQL bugs** | 19 | 8 |
| **SpacetimeDB fix sessions** | 4 | 1 |
| **PostgreSQL fix sessions** | 17 | 10 |

Both runs agree: SpacetimeDB apps are cheaper to build, have fewer bugs,
and require fewer fix iterations. The refined methodology (Run 2)
widened the cost gap and **confirmed the advantage is structural, not an
artifact of domain-biased SDK docs.**

## Performance benchmark (`perf-benchmark/`)

Stress throughput tool that fires concurrent writers at peak saturation
against the AI-generated `send_message` handlers.

| Tier | SpacetimeDB (avg) | PostgreSQL (avg) | Ratio |
|---|---|---|---|
| AI-generated (as-shipped) | 5,267 msgs/sec | 694 msgs/sec | 7.6x |
| PG rate limit removed | 5,267 msgs/sec | 1,070 msgs/sec | 4.9x |
| Optimized (same features kept) | 25,278 msgs/sec | 1,139 msgs/sec |
22x |

The gap widens with optimization because SpacetimeDB's bottleneck is
fixable code patterns in the reducer while PostgreSQL's bottleneck is
architectural (sequential network round-trips to an external database).

Optimized reference code with all features preserved is in
`perf-benchmark/results/optimized-reference/`.

## Data handling

Per-session cost summaries (`cost-summary.json`, `COST_REPORT.md`,
`metadata.json`) are committed. Raw OTel telemetry
(`raw-telemetry.jsonl`) containing PII is excluded via `.gitignore` and
stored privately.

# API and ABI breaking changes

None. All changes are in `tools/llm-sequential-upgrade/`. No production
code, library, or SDK changes.

# Expected complexity level and risk

**1 - Trivial.** Self-contained benchmarking tooling and data. No
interaction with production code.

# Testing

- [x] L1-L12 upgrades completed on all 4 apps (2 backends x 2 runs) with
OTel cost capture
- [x] All levels manually graded after each upgrade; bugs filed and
fixed via the harness
- [x] Methodology refinement between runs validated (domain bias
removal, feature-neutral instructions)
- [x] Stress benchmarks run across both runs x 3 tiers (as-shipped,
rate-limit-removed, optimized)
- [x] Optimized benchmarks verified to preserve all original features
- [x] Sensitive data (PII in raw telemetry) removed from repo and
gitignored
- [ ] Reviewer: spot-check that METRICS_DATA.json / METRICS_REPORT.json
numbers match the telemetry cost-summary.json files

---------

Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
2026-06-10 16:37:33 +00:00
..

LLM Sequential Upgrade Benchmark

Automated benchmark harness for measuring AI app-generation cost, bug rate, and code size across backends. Designed to produce directly comparable data for the same app built on different stacks.

Results viewer: https://spacetimedb.com/llms-benchmark-sequential-upgrade

Generated test data (app source, telemetry, cost summaries): https://github.com/clockworklabs/spacetimedb-ai-test-results

What this measures

For each backend under test, the harness drives a headless Claude Code session to:

  1. Generate a chat app from the L1 feature spec
  2. Upgrade through L2-L12 one feature group at a time
  3. After each level, a human grades the app against the feature spec
  4. Bugs are filed as BUG_REPORT.md and fixed via a separate Claude Code session
  5. All API costs are captured via OpenTelemetry and written to per-session cost summaries

Side-by-side results give a direct comparison of AI-generation cost across backends for the same functional target.

Directory contents

  • run.sh: orchestrates generation, upgrade, and fix sessions. Supports --upgrade, --fix, --composed-prompt, --resume-session.
  • grade.sh / grade-agents.sh / grade-playwright.sh: grading harnesses (manual + automated)
  • benchmark.sh / run-loop.sh: batch runners for parallel or sequential benchmark execution
  • cleanup.sh / reset-app.sh: dev utilities
  • benchmark-viewer.html: local viewer for METRICS_DATA.json files (open in browser, drop JSON)
  • generate-report.mjs: aggregate per-session cost-summary.json into a markdown report
  • parse-telemetry.mjs: parse OTel log stream into per-session cost-summary.json
  • parse-playwright-results.mjs: convert Playwright JSON output to grading markdown
  • docker-compose.otel.yaml / otel-collector-config.yaml: OTel collector + PostgreSQL
  • backends/: per-backend setup / SDK reference documents given to the AI
  • perf-benchmark/: runtime throughput benchmark (msgs/sec) for the AI-generated apps
  • CLAUDE.md / DEVELOP.md / GRADING.md / GRADING_WORKFLOW.md: process documentation

Running a benchmark

# Prereqs: Claude CLI installed, Docker running, SpacetimeDB installed
docker compose -f docker-compose.otel.yaml up -d

# Generate L1 from scratch
./run.sh --backend spacetime --level 1
./run.sh --backend postgres --level 1

# Upgrade through levels
./run.sh --upgrade <app-dir> --level 2 --composed-prompt
# ... continue through L12

# Fix bugs found during grading
./run.sh --fix <app-dir> --level N

Generated apps and telemetry land in sequential-upgrade/sequential-upgrade-<timestamp>/ locally. For published test data from canonical runs, see the AI Test Results repo.

Performance benchmark

perf-benchmark/ contains a runtime stress tool that fires concurrent writers against a generated app's send_message handler to measure sustained throughput in messages/sec. See perf-benchmark/README.md for usage.