# Description of Changes AI app generation benchmark comparing SpacetimeDB vs PostgreSQL (Express + Socket.io + Drizzle ORM). Same AI model (Claude Sonnet 4.6), same prompts, same chat app, two backends. Upgraded through 12 feature levels, manually graded at each level, bugs fixed, all costs measured via OpenTelemetry. Results viewable at: https://spacetimedb.com/llms-benchmark-sequential-upgrade ## Benchmark harness (`tools/llm-sequential-upgrade/`) - `run.sh`: orchestrates headless Claude Code sessions for code generation, sequential upgrades, and bug fixes. Tracks all API costs via OTel. Supports `--upgrade`, `--fix`, `--composed-prompt`, `--resume-session` modes. - `grade.sh` / `grade-agents.sh`: grading harnesses for manual testing of generated apps. - `docker-compose.otel.yaml`: OTel collector + PostgreSQL services. - `generate-report.mjs` / `parse-telemetry.mjs`: aggregate per-session telemetry into cost reports. - Backend guidelines in `backends/`: SpacetimeDB SDK reference, config templates, server setup docs, PostgreSQL setup with Drizzle/Socket.io guidance. **After https://github.com/clockworklabs/SpacetimeDB/pull/4740 merges, we will likely want to update this so that it reads backend and SDK guidance from SKILLS** ## Two complete benchmark runs **Run 1 (20260403):** Original methodology. **Run 2 (20260406):** Refined methodology with domain bias removed from SpacetimeDB SDK docs and PostgreSQL instructions made feature-spec-neutral. **Note: no meaningful changes in results were observed with these changes. Domain familiarity biases were very small and almost certainly not the cause of STDB's major gains over PG stack.** Each run contains full L1-L12 app source for both backends, level snapshots preserving state before each upgrade, and per-session OTel cost summaries. ## 12 feature levels | Level | Feature | |---|---| | L1 | Basic Chat + Typing + Read Receipts + Unread Counts | | L2 | Scheduled Messages | | L3 | Ephemeral Messages | | L4 | Message Reactions | | L5 | Message Editing with History | | L6 | Real-Time Permissions (kick, ban, promote) | | L7 | Rich User Presence | | L8 | Message Threading | | L9 | Private Rooms + Direct Messages | | L10 | Room Activity Indicators | | L11 | Draft Sync | | L12 | Anonymous to Registered Migration | ## Results | | Run 1 (20260403) | Run 2 (20260406) | |---|---|---| | **SpacetimeDB total cost** | $13.33 | $12.62 | | **PostgreSQL total cost** | $17.80 | $19.68 | | **SpacetimeDB bugs** | 5 | 2 | | **PostgreSQL bugs** | 19 | 8 | | **SpacetimeDB fix sessions** | 4 | 1 | | **PostgreSQL fix sessions** | 17 | 10 | Both runs agree: SpacetimeDB apps are cheaper to build, have fewer bugs, and require fewer fix iterations. The refined methodology (Run 2) widened the cost gap and **confirmed the advantage is structural, not an artifact of domain-biased SDK docs.** ## Performance benchmark (`perf-benchmark/`) Stress throughput tool that fires concurrent writers at peak saturation against the AI-generated `send_message` handlers. | Tier | SpacetimeDB (avg) | PostgreSQL (avg) | Ratio | |---|---|---|---| | AI-generated (as-shipped) | 5,267 msgs/sec | 694 msgs/sec | 7.6x | | PG rate limit removed | 5,267 msgs/sec | 1,070 msgs/sec | 4.9x | | Optimized (same features kept) | 25,278 msgs/sec | 1,139 msgs/sec | 22x | The gap widens with optimization because SpacetimeDB's bottleneck is fixable code patterns in the reducer while PostgreSQL's bottleneck is architectural (sequential network round-trips to an external database). Optimized reference code with all features preserved is in `perf-benchmark/results/optimized-reference/`. ## Data handling Per-session cost summaries (`cost-summary.json`, `COST_REPORT.md`, `metadata.json`) are committed. Raw OTel telemetry (`raw-telemetry.jsonl`) containing PII is excluded via `.gitignore` and stored privately. # API and ABI breaking changes None. All changes are in `tools/llm-sequential-upgrade/`. No production code, library, or SDK changes. # Expected complexity level and risk **1 - Trivial.** Self-contained benchmarking tooling and data. No interaction with production code. # Testing - [x] L1-L12 upgrades completed on all 4 apps (2 backends x 2 runs) with OTel cost capture - [x] All levels manually graded after each upgrade; bugs filed and fixed via the harness - [x] Methodology refinement between runs validated (domain bias removal, feature-neutral instructions) - [x] Stress benchmarks run across both runs x 3 tiers (as-shipped, rate-limit-removed, optimized) - [x] Optimized benchmarks verified to preserve all original features - [x] Sensitive data (PII in raw telemetry) removed from repo and gitignored - [ ] Reviewer: spot-check that METRICS_DATA.json / METRICS_REPORT.json numbers match the telemetry cost-summary.json files --------- Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com> Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
LLM Sequential Upgrade Benchmark
Automated benchmark harness for measuring AI app-generation cost, bug rate, and code size across backends. Designed to produce directly comparable data for the same app built on different stacks.
Results viewer: https://spacetimedb.com/llms-benchmark-sequential-upgrade
Generated test data (app source, telemetry, cost summaries): https://github.com/clockworklabs/spacetimedb-ai-test-results
What this measures
For each backend under test, the harness drives a headless Claude Code session to:
- Generate a chat app from the L1 feature spec
- Upgrade through L2-L12 one feature group at a time
- After each level, a human grades the app against the feature spec
- Bugs are filed as
BUG_REPORT.mdand fixed via a separate Claude Code session - All API costs are captured via OpenTelemetry and written to per-session cost summaries
Side-by-side results give a direct comparison of AI-generation cost across backends for the same functional target.
Directory contents
run.sh: orchestrates generation, upgrade, and fix sessions. Supports--upgrade,--fix,--composed-prompt,--resume-session.grade.sh/grade-agents.sh/grade-playwright.sh: grading harnesses (manual + automated)benchmark.sh/run-loop.sh: batch runners for parallel or sequential benchmark executioncleanup.sh/reset-app.sh: dev utilitiesbenchmark-viewer.html: local viewer for METRICS_DATA.json files (open in browser, drop JSON)generate-report.mjs: aggregate per-session cost-summary.json into a markdown reportparse-telemetry.mjs: parse OTel log stream into per-session cost-summary.jsonparse-playwright-results.mjs: convert Playwright JSON output to grading markdowndocker-compose.otel.yaml/otel-collector-config.yaml: OTel collector + PostgreSQLbackends/: per-backend setup / SDK reference documents given to the AIperf-benchmark/: runtime throughput benchmark (msgs/sec) for the AI-generated appsCLAUDE.md/DEVELOP.md/GRADING.md/GRADING_WORKFLOW.md: process documentation
Running a benchmark
# Prereqs: Claude CLI installed, Docker running, SpacetimeDB installed
docker compose -f docker-compose.otel.yaml up -d
# Generate L1 from scratch
./run.sh --backend spacetime --level 1
./run.sh --backend postgres --level 1
# Upgrade through levels
./run.sh --upgrade <app-dir> --level 2 --composed-prompt
# ... continue through L12
# Fix bugs found during grading
./run.sh --fix <app-dir> --level N
Generated apps and telemetry land in sequential-upgrade/sequential-upgrade-<timestamp>/ locally. For published test data from canonical runs, see the AI Test Results repo.
Performance benchmark
perf-benchmark/ contains a runtime stress tool that fires concurrent writers against a generated app's send_message handler to measure sustained throughput in messages/sec. See perf-benchmark/README.md for usage.