Changes the commitlog (and durability) write API, such that the caller
decides how many transactions are in a single commit, and has to supply
the transaction offsets.
This simplifies commitlog-side buffering logic to essentially a
`BufWriter` (which, of course, we must not forget to flush). This will
help throughput, but offers less opportunity to retry failed writes.
This is probably a good thing, as disks can fail in erratic ways, and we
should rather crash and re-verify the commitlog (suffix) than continue
writing.
To that end, this patch liberally raises panics when there is a chance
that internal state could be "poisoned" by partial writes, which may be
debatable.
# Motivation
The main motivation is to avoid maintaining the transaction offset in
two places in such a way that they could diverge. As ordering commits is
the responsibility of the datastore, we make it authoritative on this
matter -- the commitlog will still check that offsets are contiguous,
and refuse to commit if that's not the case.
A secondary, related motivation is the following:
A "commit" is an atomic unit of storage, meaning that a torn (partial)
write of a commit will render the entire commit corrupt. There hasn't
been a compelling case where we would want this, and have always
configured the server to write exactly one transaction per commit.
The code to handle buffering of transactions is, however, rather
complex, as it tries hard to allow the caller to retry writes at commit
boundaries. An unfortunate consequence of this is that we'd flush to the
OS very often, leaving throughput performance on the table.
So, if there is a compelling case for batching multiple transactions in
a commit, it should be the datastore's responsibility.
# API and ABI breaking changes
Breaks internal APIs only.
# Expected complexity level and risk
5 - Mostly for the risk
# Testing
Existing tests.
# Description of Changes
We've run into a problem on Maincloud caused by a database that was
writing a relatively small number of very large transactions. This was
accruing many commitlog segments consuming hundreds of gigabytes of
disk, but had not ever taken a snapshot, or compressed or archived any
data, as the database had not progressed past one million transactions.
With this PR, we take a snapshot every time the commitlog segment
rotates. We still also snapshot every million transactions.
One BitCraft database we looked at had 2.5 million transactions per
commitlog segment, meaning that this change will not meaningfully affect
the frequency of snapshots. The offending Maincloud database, however,
had only 50 transactions per segment!
# API and ABI breaking changes
N/a
# Expected complexity level and risk
3: Hastily made changes to finnicky code across several crates.
# Testing
I am unsure how to test these changes.
- [ ] <!-- maybe a test you want to do -->
- [ ] <!-- maybe a test you want a reviewer to do, so they can check it
off when they're satisfied. -->