SpacetimeDB

mirror of https://github.com/clockworklabs/SpacetimeDB.git synced 2026-06-27 16:30:35 -04:00

Files

T

History

bradleyshep fb0a458d4f LLM Benchmark: Sequential Upgrades Test (#4817 )

# Description of Changes

AI app generation benchmark comparing SpacetimeDB vs PostgreSQL (Express
+ Socket.io + Drizzle ORM). Same AI model (Claude Sonnet 4.6), same
prompts, same chat app, two backends. Upgraded through 12 feature
levels, manually graded at each level, bugs fixed, all costs measured
via OpenTelemetry.

Results viewable at:
https://spacetimedb.com/llms-benchmark-sequential-upgrade

## Benchmark harness (`tools/llm-sequential-upgrade/`)

- `run.sh`: orchestrates headless Claude Code sessions for code
generation, sequential upgrades, and bug fixes. Tracks all API costs via
OTel. Supports `--upgrade`, `--fix`, `--composed-prompt`,
`--resume-session` modes.
- `grade.sh` / `grade-agents.sh`: grading harnesses for manual testing
of generated apps.
- `docker-compose.otel.yaml`: OTel collector + PostgreSQL services.
- `generate-report.mjs` / `parse-telemetry.mjs`: aggregate per-session
telemetry into cost reports.
- Backend guidelines in `backends/`: SpacetimeDB SDK reference, config
templates, server setup docs, PostgreSQL setup with Drizzle/Socket.io
guidance.
**After https://github.com/clockworklabs/SpacetimeDB/pull/4740 merges,
we will likely want to update this so that it reads backend and SDK
guidance from SKILLS**

## Two complete benchmark runs

**Run 1 (20260403):** Original methodology.
**Run 2 (20260406):** Refined methodology with domain bias removed from
SpacetimeDB SDK docs and PostgreSQL instructions made
feature-spec-neutral.
**Note: no meaningful changes in results were observed with these
changes. Domain familiarity biases were very small and almost certainly
not the cause of STDB's major gains over PG stack.**

Each run contains full L1-L12 app source for both backends, level
snapshots preserving state before each upgrade, and per-session OTel
cost summaries.

## 12 feature levels

| Level | Feature |
|---|---|
| L1 | Basic Chat + Typing + Read Receipts + Unread Counts |
| L2 | Scheduled Messages |
| L3 | Ephemeral Messages |
| L4 | Message Reactions |
| L5 | Message Editing with History |
| L6 | Real-Time Permissions (kick, ban, promote) |
| L7 | Rich User Presence |
| L8 | Message Threading |
| L9 | Private Rooms + Direct Messages |
| L10 | Room Activity Indicators |
| L11 | Draft Sync |
| L12 | Anonymous to Registered Migration |

## Results

| | Run 1 (20260403) | Run 2 (20260406) |
|---|---|---|
| **SpacetimeDB total cost** | $13.33 | $12.62 |
| **PostgreSQL total cost** | $17.80 | $19.68 |
| **SpacetimeDB bugs** | 5 | 2 |
| **PostgreSQL bugs** | 19 | 8 |
| **SpacetimeDB fix sessions** | 4 | 1 |
| **PostgreSQL fix sessions** | 17 | 10 |

Both runs agree: SpacetimeDB apps are cheaper to build, have fewer bugs,
and require fewer fix iterations. The refined methodology (Run 2)
widened the cost gap and **confirmed the advantage is structural, not an
artifact of domain-biased SDK docs.**

## Performance benchmark (`perf-benchmark/`)

Stress throughput tool that fires concurrent writers at peak saturation
against the AI-generated `send_message` handlers.

| Tier | SpacetimeDB (avg) | PostgreSQL (avg) | Ratio |
|---|---|---|---|
| AI-generated (as-shipped) | 5,267 msgs/sec | 694 msgs/sec | 7.6x |
| PG rate limit removed | 5,267 msgs/sec | 1,070 msgs/sec | 4.9x |
| Optimized (same features kept) | 25,278 msgs/sec | 1,139 msgs/sec |
22x |

The gap widens with optimization because SpacetimeDB's bottleneck is
fixable code patterns in the reducer while PostgreSQL's bottleneck is
architectural (sequential network round-trips to an external database).

Optimized reference code with all features preserved is in
`perf-benchmark/results/optimized-reference/`.

## Data handling

Per-session cost summaries (`cost-summary.json`, `COST_REPORT.md`,
`metadata.json`) are committed. Raw OTel telemetry
(`raw-telemetry.jsonl`) containing PII is excluded via `.gitignore` and
stored privately.

# API and ABI breaking changes

None. All changes are in `tools/llm-sequential-upgrade/`. No production
code, library, or SDK changes.

# Expected complexity level and risk

**1 - Trivial.** Self-contained benchmarking tooling and data. No
interaction with production code.

# Testing

- [x] L1-L12 upgrades completed on all 4 apps (2 backends x 2 runs) with
OTel cost capture
- [x] All levels manually graded after each upgrade; bugs filed and
fixed via the harness
- [x] Methodology refinement between runs validated (domain bias
removal, feature-neutral instructions)
- [x] Stress benchmarks run across both runs x 3 tiers (as-shipped,
rate-limit-removed, optimized)
- [x] Optimized benchmarks verified to preserve all original features
- [x] Sensitive data (PII in raw telemetry) removed from repo and
gitignored
- [ ] Reviewer: spot-check that METRICS_DATA.json / METRICS_REPORT.json
numbers match the telemetry cost-summary.json files

---------

Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>
Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>

2026-06-10 16:37:33 +00:00

postgres.md

LLM Benchmark: Sequential Upgrades Test (#4817 )

2026-06-10 16:37:33 +00:00

spacetime-sdk-rules.md

LLM Benchmark: Sequential Upgrades Test (#4817 )

2026-06-10 16:37:33 +00:00

spacetime-templates.md

LLM Benchmark: Sequential Upgrades Test (#4817 )

2026-06-10 16:37:33 +00:00

spacetime.md

LLM Benchmark: Sequential Upgrades Test (#4817 )

2026-06-10 16:37:33 +00:00