mirror of
https://github.com/valkey-io/valkey.git
synced 2026-05-06 05:26:42 -04:00
Add design-docs folder and README. (#3300)
These are forked from the RFC instructions created by @zuiderkwast and @hpatro in https://github.com/valkey-io/valkey-rfc/pulls/1 and https://github.com/valkey-io/valkey-rfc/pulls/6. It also includes the Atomic Slot Migration design to bootstrap the folder. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>
This commit is contained in:
@@ -0,0 +1,67 @@
|
||||
# Valkey Design Documents
|
||||
|
||||
This folder collects designs for Valkey. These designs detail features and
|
||||
changes that require more detail than just the text in a pull request or an
|
||||
issue.
|
||||
|
||||
A markdown file describes each feature or larger topic.
|
||||
|
||||
## Workflow
|
||||
|
||||
**IMPORTANT: Before writing a design, start an issue for some early alignment.**
|
||||
This is the first step to find interested parties and collect initial
|
||||
requirements. This issue serves as the overall tracking for the feature.
|
||||
|
||||
1. After finishing initial discussion on the issue, determine the necessity of a
|
||||
design document. Small features or changes do not require designs.
|
||||
Wide-reaching changes or those requiring major alignment make good candidates
|
||||
for a design.
|
||||
2. To create a design document, create a new markdown file in this `design-docs`
|
||||
directory.
|
||||
3. The maintainers review and approve the design document once authors address
|
||||
feedback. Submitting a design document does not necessarily bind to a certain
|
||||
design.
|
||||
4. The design document serves as living documentation. As developers build the
|
||||
feature, the design document captures key design aspects. Commit changes to
|
||||
the design document alongside the code changes as developers implement the
|
||||
mid-level design details.
|
||||
|
||||
## What's useful to include?
|
||||
|
||||
**Design documents do not follow a strict format.** There is no template.
|
||||
|
||||
When writing a design, consider that many people, including those unfamiliar
|
||||
with the feature, read it. It should be self-contained and easy to understand.
|
||||
|
||||
The following sections are optional but provide a high-level overview of
|
||||
potential inclusions:
|
||||
|
||||
- **High-level overview**: A brief summary of the feature and its purpose.
|
||||
- **Key design elements**: The following are generally useful to include:
|
||||
- State machines
|
||||
- Data structures
|
||||
- Algorithms
|
||||
- Interaction with other Valkey components (replication, persistence, cluster,
|
||||
modules, etc.)
|
||||
- **Links to key issues/PRs**: Link to relevant issues/PRs for further reading.
|
||||
- **Links to relevant code**: Link to relevant code files for further reading.
|
||||
|
||||
## What not to include?
|
||||
|
||||
Overdocumentation often leads to stale design documents. A design document is
|
||||
not the best place for the following:
|
||||
|
||||
- **API Details**: API details belong in the
|
||||
[public Valkey documentation](https://valkey.io/commands/)
|
||||
- **Low-level Implementation Details**: Document difficult or complex
|
||||
implementation details in code comments, not design documents.
|
||||
- **Edge Cases**: Prefer to link to test cases or code locations that cover edge
|
||||
cases, rather than documenting them in the design document.
|
||||
- **Alternatives/Rejected Designs**: Document decision making in the issue or PR
|
||||
where the contributors made the decision.
|
||||
- **Overly Verbose Explanations**: Aim to use Mermaid or ASCII diagrams to
|
||||
explain complex concepts rather than prose.
|
||||
- **Boilerplate**: Keep every document minimal and to the point. Avoid
|
||||
unnecessary sections.
|
||||
- **Future work**: File issues for future work items to track them, rather than
|
||||
including them in the design document.
|
||||
@@ -0,0 +1,184 @@
|
||||
# Design Document: Atomic Slot Migration
|
||||
|
||||
## 1. Overview
|
||||
|
||||
Atomic Slot Migration (ASM) provides a seamless, atomic method for migrating
|
||||
hash slots between nodes in a Valkey cluster. This mechanism replaces
|
||||
`CLUSTER SETSLOT IMPORTING/MIGRATING` and `MIGRATE` for migrating slots between
|
||||
nodes.
|
||||
|
||||
## 2. Core Mechanics
|
||||
|
||||
Rather than migrating data key-by-key, ASM operates at the slot level by
|
||||
adapting existing replication and failover primitives:
|
||||
|
||||
- **Slot-Based Replication:** Physical data migration borrows from
|
||||
Primary-Replica replication mechanisms, but strictly scopes to the specific
|
||||
slots being moved.
|
||||
- **Atomic Ownership Transfer:** ASM executes the final handover of slot
|
||||
ownership using a coordinated process similar to a Manual Failover, ensuring
|
||||
an atomic transfer.
|
||||
- **Traffic Handling:** Throughout the migration process, the source node
|
||||
retains the data and continues to actively serve business requests. The system
|
||||
cleanly cuts over traffic to the target node only after completing the atomic
|
||||
transfer.
|
||||
|
||||
## 3. Implementation Details
|
||||
|
||||
### 3.1 High Level Overview
|
||||
|
||||
1. **Snapshot Transfer:** The source node transfers data to the target node. The
|
||||
source node forks a child process to iterate and serialize the slot's keys.
|
||||
|
||||
The source transfers data in "AOF" (Append Only File) format to the target
|
||||
node. This format consists of a stream of commands. Consequently, the target
|
||||
primary and target replicas replay these commands to restore the slot's
|
||||
state.
|
||||
|
||||
2. **Incremental Updates:** While transferring the initial snapshot, the source
|
||||
node serves business requests. The source node records any changes to the
|
||||
slot's keys during this time and sends them to the target node as incremental
|
||||
updates after completing the snapshot transfer.
|
||||
3. **Pause:** After sending the incremental updates, the source node pauses
|
||||
writes to the migrating slots. Consequently, the source node rejects any
|
||||
further business requests for those slots. This pause ensures the target node
|
||||
maintains the same state as the source node.
|
||||
4. **Failover:** After the source node pauses, the target node performs a
|
||||
takeover and becomes the primary node for the slot.
|
||||
5. **Clean Up:** After the target node becomes the primary node for the slot,
|
||||
the source node receives this information via cluster topology updates. The
|
||||
source node then unpauses and completes the slot migration. Failed migrations
|
||||
on the target side are cleaned up by deleting keys that are no longer owned
|
||||
by the node.
|
||||
|
||||
### 3.2 CLUSTER SYNCSLOTS
|
||||
|
||||
The source, target, and target replica use the `CLUSTER SYNCSLOTS` command to
|
||||
coordinate the handover state:
|
||||
|
||||
```
|
||||
Source Target Target Replica
|
||||
| | |
|
||||
|------------ SYNCSLOTS ESTABLISH -------------->| |
|
||||
| |----- SYNCSLOTS ESTABLISH ------>|
|
||||
|<-------------------- +OK ----------------------| |
|
||||
| | |
|
||||
|---------------- SYNCSLOTS ACK ---------------->| |
|
||||
| | |
|
||||
|~~~~~~~~~~~~~~ snapshot as AOF ~~~~~~~~~~~~~~~~>| |
|
||||
| |~~~~~~ forward snapshot ~~~~~~~~>|
|
||||
|----------- SYNCSLOTS SNAPSHOT-EOF ------------>| |
|
||||
| | |
|
||||
|<----------- SYNCSLOTS REQUEST-PAUSE -----------| |
|
||||
| | |
|
||||
|~~~~~~~~~~~~ incremental changes ~~~~~~~~~~~~~~>| |
|
||||
| |~~~~~~ forward changes ~~~~~~~~~>|
|
||||
|--------------- SYNCSLOTS PAUSED -------------->| |
|
||||
| | |
|
||||
|<---------- SYNCSLOTS REQUEST-FAILOVER ---------| |
|
||||
| | |
|
||||
|---------- SYNCSLOTS FAILOVER-GRANTED --------->| |
|
||||
| | |
|
||||
| (performs takeover & |
|
||||
| propagates topology) |
|
||||
| | |
|
||||
| |------- SYNCSLOTS FINISH ------->|
|
||||
(finds out about topology | |
|
||||
change & marks migration done) | |
|
||||
| | |
|
||||
```
|
||||
|
||||
Throughout the migration, both the source and target nodes exchange periodic
|
||||
`SYNCSLOTS ACK` messages to monitor the health and progress of the operation.
|
||||
If a node fails to receive an acknowledgment within the replication timeout,
|
||||
the migration is aborted.
|
||||
|
||||
See code comments in [cluster_migrateslots.c](../src/cluster_migrateslots.c) for
|
||||
detailed state machines.
|
||||
|
||||
### 3.3 Automatic Rollback
|
||||
|
||||
Various scenarios cause slot migration failure:
|
||||
|
||||
1. Link between source and target disconnects
|
||||
2. Source or target node crash, halt, or encounter a partition
|
||||
3. A failover occurs on the source or target node
|
||||
4. Out of memory error occurs on the target node
|
||||
5. Client output buffer on the source node grows too large
|
||||
6. An administrator executes `FLUSHDB` on the source or target node
|
||||
|
||||
In such cases, ASM automatically rolls back the migration.
|
||||
|
||||
```
|
||||
Source Target Target Replica
|
||||
| | |
|
||||
|------------ SYNCSLOTS ESTABLISH -------------->| |
|
||||
| |----- SYNCSLOTS ESTABLISH ------>|
|
||||
|<-------------------- +OK ----------------------| |
|
||||
... ... ...
|
||||
| | |
|
||||
| <FAILURE> |
|
||||
| | |
|
||||
| (performs cleanup) |
|
||||
| | ~~~~~~ UNLINK <key> ... ~~~~~~~>|
|
||||
| | |
|
||||
| | ------ SYNCSLOTS FINISH ------->|
|
||||
| | |
|
||||
```
|
||||
|
||||
#### 3.3.1 Cleanup
|
||||
|
||||
The cluster automatically cleans up failed or cancelled slot migrations. The
|
||||
primary is solely responsible for cleaning up unowned slots. Primaries demoted
|
||||
during migration do not clean up previously active slot imports. The promoted
|
||||
replica is responsible for both cleaning up the slot and sending a
|
||||
`SYNCSLOTS FINISH`.
|
||||
|
||||
### 3.4 Key Containment
|
||||
|
||||
The system rejects any keyed command executed on a node that is not the primary
|
||||
for that slot with `-MOVED` (e.g. `GET`, `SET`, `DEL`, `INCR`, etc).
|
||||
|
||||
Nodes filter unkeyed read commands, like `SCAN` and `KEYS`, to avoid exposing
|
||||
importing slot data. Each node in the target shard tracks the slot migration job
|
||||
state and hides writes to that slot from the end user until the migration
|
||||
completes.
|
||||
|
||||
#### 3.4.1 Full Sync, Partial Sync, RDB
|
||||
|
||||
To ensure that replicas resyncing during an import remain aware of it, Valkey
|
||||
serializes each in-progress slot import into an RDB section defined by a new
|
||||
opcode. The encoding includes the job name and the slot ranges being imported.
|
||||
Whenever the system loads an RDB file containing a slot import section, whether
|
||||
from disk or during a primary sync, it adds a new migration to track the import.
|
||||
If the Valkey node becomes a primary after loading the RDB, it cancels the slot
|
||||
migration.
|
||||
|
||||
Failure to load the opcode results in consistency problems, so the opcode is
|
||||
mandatory. If the opcode is not recognized, the RDB load will fail.
|
||||
|
||||
Loading this tracking state on primaries ensures that replicas partially syncing
|
||||
to a restarted primary still get their `SYNCSLOTS FINISH` message in the
|
||||
replication stream.
|
||||
|
||||
#### 3.4.2 AOF
|
||||
|
||||
Valkey propagates the `ESTABLISH` and `FINISH` commands to the AOF, ensuring
|
||||
they replay properly on AOF load. Similar to RDB, if any pending `ESTABLISH`
|
||||
commands lack a subsequent `FINISH` upon becoming primary, the system fails them
|
||||
after loading.
|
||||
|
||||
## 4. External References
|
||||
|
||||
- **API & User Commands:**
|
||||
- [CLUSTER MIGRATESLOTS](https://valkey.io/commands/cluster-migrateslots/)
|
||||
- [CLUSTER GETSLOTMIGRATIONS](https://valkey.io/commands/cluster-getslotmigrations/)
|
||||
- [CLUSTER CANCELSLOTMIGRATIONS](https://valkey.io/commands/cluster-cancelslotmigrations/)
|
||||
- [CLUSTER SYNCSLOTS](https://valkey.io/commands/cluster-syncslots/)
|
||||
- **Corner Cases:** Read the test cases in
|
||||
[cluster-migrateslots.tcl](../tests/unit/cluster/cluster-migrateslots.tcl)
|
||||
- **PR References:** For further reading, refer to the following Pull Requests:
|
||||
- [PR #1949](https://github.com/valkey-io/valkey/pull/1949)
|
||||
- [PR #2755](https://github.com/valkey-io/valkey/pull/2755)
|
||||
- [PR #2593](https://github.com/valkey-io/valkey/pull/2593)
|
||||
- [PR #2635](https://github.com/valkey-io/valkey/pull/2635)
|
||||
Reference in New Issue
Block a user