Prior to this commit, `to_mem_table_with_op_type` would call `Header::find_pos_by_name`
to locate the `__op_type` column, if it existed.
This was slow, as `Header::find_pos_by_name` is a linear scan in increasing order,
and the `__op_type` column is always either last or not present,
so `to_mem_table_with_op_type` would traverse every column in the `Header`.
This happened during every incremental evaluation for every query.
With this commit, we rely on the fact that `__op_type` is always the last column if present,
and check only the last column of the header.
* Single-table subscription queries: plan once, run repeatedly
Prior to this commit, every call to `ExecutionUnit::eval_incr`
re-invoked the query planner to convert its `eval` plan
into a plan suitable for reading from a `MemTable` of updates.
With this commit, specifically for single-table select queries,
we invoke the query planner once during `ExecutionUnit::new`,
and store the resuling `eval_incr_plan` for repeated use.
A follow-up will do the same for multi-table semijoins.
* Docstrings for `SourceSet::len` and `::is_empty`
Co-authored-by: Mazdak Farrokhzad <twingoow@gmail.com>
Signed-off-by: Phoebe Goldman <phoebe@goldman-tribe.org>
* Compile select queries all the way to `QueryCode` ahead of time
---------
Signed-off-by: Phoebe Goldman <phoebe@goldman-tribe.org>
Co-authored-by: Mazdak Farrokhzad <twingoow@gmail.com>
Closes#747.
Before this change,
we would evaluate each and every query,
for each and every subscription,
on each and every row update.
If N subscriptions had a query Q in common,
it would be evaluated N different times.
With this change,
distinct queries are evaluated once,
and the results copied for each client.
So in the example above, Q would be evaluated once,
with the results transmitted to N different clients.
* First step towards removing `Table` from query plans
Rework `SourceExpr` to be a logical placeholder for a table,
rather than a table itself.
Make query eval functions take an additional argument, a set of tables.
When evaluating a `SourceExpr` to a table, they will treat the `SourceExpr`
as a reference into the set of sources, and use the referred table.
This commit modifies only the VM crate; modifications to `core` are forthcoming.
It's possible that this commit's scheme for referring to `SourceExpr`s
will need to change,
as currently it forbids duplicate `SourceExpr`s,
which I think might occur during index joins.
* `SourceBuilder` interface for assigning `SourceId`s.
* Integrate new query plan repr into `core`
* Put `DbTable` back in the AST; only `MemTable` is separate
Per review from Joshua, this commit makes `SourceExpr` into an enum
similar to the previous definition, with a `DbTable(DbTable)` variant.
Indirection to a `SourceSet` is imposed only for the `MemTable` variant.
This sould make the PR's overall diff much simpler
(assuming I haven't inadvertently made any changes
in the process of reverting the `DbTable` code paths).
Related to the above, this PR simplifies `SourceSet`.
`SourceSet` now holds a `Vec<Option<MemTable>>`, where previously
it was a transparent newtype around `[Option<Table>]`.
This change eliminates the need for unsafe unsized conversions,
removes `SourceBuilder`,
and causes `SourceSet` to be uniformly consumed by the high-level query eval operators,
where previously `SourceSet`s had to be semi-reusable
because they could contain `DbTable`s.
* Per Mazdak's review, docs!
* Adding an index selector that take in account multi-column indexes (and improve the query! macro)
* move select_best_index to vm/src/expr.rs; get rid of OpCmpIdx
* refactor test best_index
* simplify best_index* tests more
* create_table_multi_index: use ColListBuilder
* move & simplify create_table_multi_index
* simplify assert_index_scan + uses
* remove create_table, twas dead code
* ColumnOpFlat: use SmallVec instead
* simplify ScanIndex
* simplify best_index_range
* Add test for sql + joins + multi-index and fix invalid ambiguos field error
* slightly refactor select_best_index
* remove nonempty dependency
* Add test that actually run the multi-column sql
* Adding benchmark for multi vs many indexes
* simplify create_table_for_test*
* Add comments
* impl new algo for select_best_index + clone less
* improve select_best_index docs
* ScanIndex -> ScanOrIndex
* simplify is_sargable + use smallvec more
* let make_index handle a single ScanOrIndex
* make index stuff more private + remove dead code
* select_best_index: return IndexColumnOp directly; nix ScanOrindex -- this removes an allocation
* do not reconstruct scan argument; avoid heap allocations
* borrow ColList in IndexArgument + avoid temp alloc in is_sargable
* optmize_select: remove Cow from fields_found
* is_sargable: reuse allocation from extract_fields
* rename is_sargable, avoid temp fields_found allocs, simplify optmize_select
* fix subscription benches
* drive-by refactor benches/subscription
* Keep a single benchmark for location
* Squashed commit of the following:
commit e54b09bab2
Author: Mario Montoya <mamcx@elmalabarista.com>
Date: Thu Feb 29 20:19:24 2024 -0500
Correctly show the error for AmbiguousField and simplify the code (#910)
commit 48a205a818
Author: Kim Altintop <kim@eagain.io>
Date: Thu Feb 29 19:32:21 2024 +0100
core: Fix host controller to not replace module if lifecyle hooks failed (#904)
* core: Fix host controller to not replace module if lifecyle hooks failed
Previously, `spawn_module_host` would unconditionally insert the new
module into the controller state, and not remove it if the lifecycle
hooks (`init_database` / `update_database`) returned an error.
This would mean that the module code was replaced with the new one, even
if it should be rejected because the schema was not updated or the init
/ update reducer failed.
Fix this by starting the module, and later "committing" it to the
controller state in two phases.
* Add commentary about database mutations / transactionality
---------
Signed-off-by: Mario Montoya <mamcx@elmalabarista.com>
Co-authored-by: Mazdak Farrokhzad <twingoow@gmail.com>
* core: Fix host controller to not replace module if lifecyle hooks failed
Previously, `spawn_module_host` would unconditionally insert the new
module into the controller state, and not remove it if the lifecycle
hooks (`init_database` / `update_database`) returned an error.
This would mean that the module code was replaced with the new one, even
if it should be rejected because the schema was not updated or the init
/ update reducer failed.
Fix this by starting the module, and later "committing" it to the
controller state in two phases.
* Add commentary about database mutations / transactionality
* perf(832): Remove redundant row deduplication in subscriptions
Closes#832.
The database already operates under set semantics,
so unless multiple queries return rows from the same table,
deduplication of the result set is not necessary.
* Rip out all deduplication
---------
Co-authored-by: Noa <coolreader18@gmail.com>
Re https://github.com/clockworklabs/SpacetimeDB/pull/840 .
We're removing the `row_pk` from the WebSocket API `TableRowOperation`,
as computing it has a major performance impact on the server.
This commit removes references to it from the WebSocket API reference.
Re: https://github.com/clockworklabs/SpacetimeDB/pull/840
This commit updates the TypeScript SDK to no longer use the `row_pk` field
in the client API,
as that field no longer exists.
As in the other SDKs, we replace our use of the `row_pk`
with the serialized representation of the row,
as this saves our needing to have objects/dicts/hash-maps keyed on domain types.
Unlike the other SDKs, we support either the binary (protobuf) or JSON APIs.
When using the binary API, we convert the BSATN row to a string,
and use that as the `rowPk`.
When using the JSON API, we `JSON.stringify` the row itself, and use that as the `rowPk`.
The latter is ugly and not performant,
but we don't care because the JSON API is slow anyways.
This commit also removes some uses of `any` from the deserialization code,
because I wanted the compiler to double-check my work.
Re: https://github.com/clockworklabs/SpacetimeDB/pull/840
This commit updates the C# SDK to no longer use the `row_pk` field
of the Protobuf client API,
as that field has been removed. (Will have been removed, as of merging.)
Where a table cache was previously keyed on `byte[] rowPk`,
it is now keyed on `byte[] rowBytes`,
where `rowBytes` is the BSATN-encoded bytes of the row.
This means we effectively store two copies of each row in the client cache:
the BSATN serialized format, and the decoded domain type.
An alternate implementation would be to make the table caches be sets of domain types,
discarding the BSATN bytes.
We find this undesirable for several reasons:
- Hashing and equality-comparing `byte[]` is almost certainly more efficient
than doing the same for domain types.
- Even if hashing and equality-comparing domain types were efficient,
we would still have to update codegen to emit hashing and equality methods
for all types in the module_bindings.
This implementation requires no changes to the module_bindings.
- We already have the BSATN bytes sitting around,
as they're necessarily part of the message we recieve from the server.
This change does no additional serialization or deserialization.
In essence, we're trading memory for time and simplicity.
Keeping the BSATN bytes live approximately doubles the table cache's memory usage,
but simplifies the implementation greatly,
and (we suspect) speeds up table cache insertions, deletions and lookups.
* Remove `row_pk` from client API; hash rows not row_ids on client
This commit removes the `row_pk` field from `message TableRowOperation` in the client API.
This is the beginning of addressing issue #831.
This has several implications for clients:
- Because we no longer have a stable content-addressed ID for each row,
the client is responsible for hashing rows itself.
- This means that generated types must be `Hash + Eq`.
- This means that generated types must use `sats::F32` or `sats::F64`
rather than `f32` or `f64`,
as the former are `Eq + Hash` while the latter are not.
(Aside: it's stupid that Rust floats are not `Eq + Ord + Hash`,
since IEEE-754 defines a total equality and total ordering on floats.)
Exposing `sats::F32` to users is ugly, but I can't think of a better design.
This commit also makes some minor formatting changes to Rust codegen,
since it was emitting code that rustc warned about.
Still outstanding:
- [ ] Remove row_pk from the JSON API.
- [ ] Avoid computing row_pk in the subscription engine.
- [ ] Update other SDKs.
- [ ] C#
- [ ] TypeScript
- [ ] Python
* Combine similar patterns
Co-authored-by: Mazdak Farrokhzad <twingoow@gmail.com>
Signed-off-by: Phoebe Goldman <phoebe@goldman-tribe.org>
* Remove `row_pk` from JSON messages
* fmt
* SDK: ClientCache keyed on BSATN bytes, not domain types
Per Ingvar's suggestion, this commit makes the client cache
be `HashMap<Vec<u8>, T>`, where the key is a BSATN byte-buffer
containing the serialized representation of the value.
This means that generated structs and enums don't need to be `Hash + Eq`.
The key upside here is that we can revert to using `f32`/`f64`
rather than `sats::F32`/`sats::F64`.
---------
Signed-off-by: Phoebe Goldman <phoebe@goldman-tribe.org>
Co-authored-by: Mazdak Farrokhzad <twingoow@gmail.com>
Closes#890.
This changes the magic constant from 1000 to 3000 rows,
which means if the indexed table in an index join has 3000 rows or less,
we make it the probe table instead.
Prior to this commit, we ran our Rayon threads
within the Tokio blocking threadpool.
This was bad for several reasons, of which the most obvious was
that it ignored the Rayon `thread_name`.
Our Rayon threads still need access to the Tokio runtime
(which I'm not super jazzed about; see comment),
so this commit adds a custom `spawn_handler`
which behaves like the Rayon default spawn handler,
but also enters the Tokio runtime.
This commit also adds a comment about how we're over-allocating our available parallelism,
and in the future we should not do that.
* Script to run perf against SpacetimeDB
* Non-controversial script improvements
* No args is fine
---------
Co-authored-by: Boppy <no-reply@boppygames.gg>
- Move it and friends from sats to vm.
- MemTable now stores a Vec<PV>.
- Other related improvements.
Co-authored-by: Phoebe Goldman <phoebe@goldman-tribe.org>
* Replace DbProgram with a leaner version that compile directly to queries without environment
* Remove all the dead-code caused by removing the environment
* Addressing some PR comments
* Addressing some PR comments
* Lint
Set a name for Rayon threads, so that they don't inherit their thread names
from the parent, i.e. `tokio-runtime-w`.
Use the same name for all Rayon threads, i.e. ignore the thread-index,
so that tools like `perf` can merge all the Rayon threads into a single backtrace.
* optimize build_missing_tables; collect less + use read_col
* schema_for_table: don't go through PV
* table_id_from_name: use read_col
* relational_db tests: remove uses of to_product_value
* build_sequence_state: don't use to_pv
* build_indexes: don't use to_pv
* remove dead code: CommittedStateIter
* get_all_tables_tx: use read_col
* SystemTableQuery: don't use to_pv
* StModuleRow: don't use to_pv
* get rd of more to_product_value calls
* read_col / system_tables / mut_tx: cleanup + less work2