valkey

mirror of https://github.com/valkey-io/valkey.git synced 2026-05-06 13:36:47 -04:00

Author	SHA1	Message	Date
Ping Xie	5b7ac66918	Fix verify-provenance action pin (#3594 )	2026-04-29 21:30:40 -07:00
Ping Xie	98724dda08	Update provenance action to refine layer2 exemption policies (#3593 )	2026-04-29 17:06:57 -07:00
abmathur-ie	678a06d216	Set errno on EOF in syncRead and propagate it in logs (#3580 ) When read() returns 0 (EOF/connection closed) in syncRead(), errno is not set by POSIX, so it retains a stale value (typically 0). This causes callers using connGetLastError() to log strerror(0) which is the misleading string "Success". Set errno = ECONNRESET on EOF in syncRead(), matching the existing pattern used for the timeout case (errno = ETIMEDOUT). Also set conn->last_errno = errno in connSocketSyncWrite, connSocketSyncRead, and connSocketSyncReadLine wrappers, matching the pattern used by their async counterparts connSocketWrite and connSocketRead. After this fix, replica logs will show: "I/O error reading bulk count from PRIMARY: Connection reset by peer" instead of the misleading: "I/O error reading bulk count from PRIMARY: Success" --------- Signed-off-by: Abhishek Mathur <matshek@amazon.com> Signed-off-by: djk1027 <djk9510271@gmail.com> Co-authored-by: Abhishek Mathur <matshek@amazon.com> Co-authored-by: Daejun Kim <djk9510271@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>	2026-04-29 14:13:40 -07:00
bandalgomsu	7817ca8a73	Fix GEOSEARCH BYPOLYGON leak on invalid COUNT (#3568 ) Free BYPOLYGON points before returning from invalid COUNT parsing paths in GEOSEARCH/GEOSEARCHSTORE. Closes #3567 --------- Signed-off-by: Su Ko <rhtn1128@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2026-04-29 13:39:12 -04:00
Jim Brunner	ad404cd266	fix compile warning in util.c (#3585 ) Address this compile warning: ```c CC util.o util.c:638:1: warning: ‘no_sanitize’ attribute directive ignored [-Wattributes] __attribute__((no_sanitize_address, no_sanitize("thread"), used)) static int (string2ll_resolver(void))(const char , size_t, long long *) { ^~~~~~~~~~~~~ ``` Addresses portability concerns around these attributes. --------- Signed-off-by: Jim Brunner <brunnerj@amazon.com>	2026-04-29 09:31:04 -07:00
VoletiRam	39036c7c06	Add structured datasets loading capability in valkey benchmark (#2823 ) ## Background Add structured datasets loading capability. Support CSV and TSV file formats. Use `__field:fieldname__` placeholders to replace the corresponding fields from the dataset file. Support natural content size of varying length. Allow mixed placeholder usage combining dataset fields with random generators. Enable automatic field discovery from CSV/TSV headers. Use `--maxdocs` to limit the dataset loading. Rather than modifying the existing placeholder system, we detect field placeholders and switch to a separate code path that builds commands from scratch using `valkeyFormatCommandArgv()`. This ensures: - Zero impact on existing functionality - Full support for variable-size content - Thread-safe atomic record iteration - Compatible with pipelining and threading modes __Usage examples__ ```sh # Strings - Simple key-value with dataset fields ./valkey-benchmark --dataset products.csv -n 10000 SET product:__rand_int__ "__field:name__" # Sets - Unique collections from dataset ./valkey-benchmark --dataset categories.csv -n 10000 SADD tags:__rand_int__ "__field:category__" # CSV dataset with document limit ./valkey-benchmark --dataset wiki.csv --maxdocs 100000 -n 50000 HSET doc:__rand_int__ title "__field:title__" body "__field:abstract__" # Mixed placeholders (dataset + random) ./valkey-benchmark --dataset terms.csv -r 5000000 -n 50000 HSET search:__rand_int__ term "__field:term__" score __rand_1st__ ``` __Full-Text Search Benchmarking__ ```sh # Search hit scenarios (existing terms) ./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__" # Search miss scenarios (non-existent terms) ./valkey-benchmark --dataset miss_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__" # Query variations ./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "@title:__field:term__" ./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__*" ``` __Benchmark Results__ Test environment: __Instance:__ AWS c7i.16xlarge, 64 vCPU Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory \| Configuration \| Throughput \| CPU Usage \| Wall Time \| Memory Peak \| \|---------------\|------------\|-----------\|-----------\|-------------\| \| Single-threaded, P1 \| 93,295 RPS \| 99% \| 71.4s \| 5.8GB \| \| Multi-threaded (10), P1 \| 93,332 RPS \| 137% \| 71.5s \| 5.8GB \| \| Single-threaded, P10 \| 274,499 RPS \| 96% \| 36.1s \| 5.8GB \| \| Multi-threaded (4), P10 \| 344,589 RPS \| 161% \| 32.4s \| 5.8GB \| --------- Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com> Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>	2026-04-29 09:18:37 -07:00
Jim Brunner	16ed690fec	fix LTO compilation warning in eval (#3584 ) I noticed this LTO compile warning in the eval code. Looks like it's getting confused about an sds length, even though checked above. Just added an assert to clarify. The warning: ```c LINK valkey-server eval.c: In function ‘evalExtractShebangFlags’: eval.c:263:27: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=] out_engine = zcalloc(engine_name_len + 1); ^ zmalloc.c:324:7: note: in a call to allocation function ‘valkey_calloc’ declared here void zcalloc(size_t size) { ^ ``` Signed-off-by: Jim Brunner <brunnerj@amazon.com>	2026-04-28 23:13:56 -07:00
Binbin	bef46dacc1	Skip cluster resharding test under valgrind (#3574 ) This change was introduced in #3382. This test is already very slow on its own. Under valgrind it gets slow enough that the per-node restart step lets primaries be marked FAIL and triggers failovers, after which "Verify slaves consistency" no longer holds since it assumes the original topology. It was never run under valgrind before and exercises nothing valgrind meaningfully covers, so just tag it valgrind:skip. Signed-off-by: Binbin <binloveplay1314@qq.com>	2026-04-29 10:27:04 +08:00
Daejun Kim	8091c6c10a	Remove redundant count division in genericHgetallCommand (#3573 ) The argument `count /= 2` modifies `count` as a side effect, and the following `count /= 2` divides it again unnecessarily. Since `count` is not used after this point, fix it by using `count / 2` without the side effect and remove the redundant second assignment. Signed-off-by: djk1027 <djk9510271@gmail.com>	2026-04-28 11:43:56 +03:00
Jun Yeong Kim	ff80b2d1dc	Migrate the remaining cluster tests to the new framework and remove legacy files (#2297 ) (#3382 ) Migrated the remaining cluster tests to tests/unit/cluster/ to use the same framework for all cluster tests. Cleaned up the obsolete cluster test framework files and updated the CI workflows to use the new unified test runner. Changes: Moved and mapped 6 test files: - 03-failover-loop.tcl → Merged into existing failover.tcl - 04-resharding.tcl → resharding.tcl - 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl → replica-migration-slow.tcl - 07-replica-migration.tcl → Merged into existing replica-migration.tcl - 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl Other changes: - Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv) - Added process_is_alive check in cluster_util.tcl to fix an exception in failover tests caused by executing ps on dead processes - Heavy tests (resharding, replica-migration-slow) marked with slow tag and wrapped in run_solo to prevent resource contention in sanitizer environments - replica-migration-slow marked with valgrind:skip tag since it is very slow - Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl, includes/, and helpers/ - Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@") - Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster tests are now included in ./runtest Closes #2297. Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2026-04-27 17:31:37 +08:00
eifrah-aws	6dbb7f81a9	Fix remove cached eval scripts on engine unregister (#3503 ) Remove eval script cache entries that belong to a scripting engine when that engine is unregistered. This prevents the eval cache from retaining dangling engine pointers and keeps the tracked script memory in sync after engine shutdown. The scripting engine unregister path now invokes a new eval cleanup helper, which scans the cached scripts, drops matching entries from the LRU list and dictionary, and adjusts cache memory accounting accordingly. * scripting engine * eval cache Signed-off-by: Eran Ifrah <eifrah@amazon.com>	2026-04-27 11:29:20 +08:00
Jacob Murphy	28ecbd204f	Ensure client slot migration pointer is cleared during reset (#3554 ) If not cleared, the job may no longer be valid by the time the client goes to cleanup. This dangling reference could cause a crash if you set slot-migration-log-max-len to 0 and are very unlucky. Signed-off-by: Jacob Murphy <jkmurphy@google.com>	2026-04-27 11:05:35 +08:00
Binbin	a3e44a55d3	Fix lua-enable-insecure-api default value cannot be changed to yes (#3548 ) The default value of lua-enable-insecure-api cannot be safely changed from no to yes due to two issues: 1. In createEngineContext(), lua_enable_insecure_api was hardcoded to 0 before initializing Lua states, so deprecated APIs (newproxy, setfenv, getfenv) were never registered in the global table regardless of the actual config value. Once the global table is locked, the config change has no effect. 2. lua_insecure_api_current was initialized to 0 (struct zero-init) and never synced with the final config value. If the default was changed to yes(1), a subsequent CONFIG SET no would see both values as 0 and skip the evalReset() call in updateLuaEnableInsecureApi(). Fix by reading the real config via isLuaInsecureAPIEnabled() in createEngineContext() before Lua state initialization, and syncing lua_insecure_api_current after all config sources (default, config file, command-line args) are applied. Signed-off-by: Binbin <binloveplay1314@qq.com>	2026-04-27 11:04:14 +08:00
Binbin	ac9ca9de3d	Fix rdmaServer leaks when create listen cm id error (#3557 ) In here we should go to error to free the resources: ``` error: if (listen_cmid) rdma_destroy_id(listen_cmid); if (listen_channel) rdma_destroy_event_channel(listen_channel); ret = ANET_ERR; end: freeaddrinfo(servinfo); return ret; } ``` Signed-off-by: Binbin <binloveplay1314@qq.com>	2026-04-27 11:01:52 +08:00
charsyam	bb88665578	hashtable: fix dismissHashtable madvise size (#3533 ) The bug was in dismissHashtable(), which computes the size passed to zmadvise_dontneed() for the top-level hashtable tables. ht->tables[i] points to a contiguous array of bucket objects, but the code used sizeof(bucket ) instead of sizeof(bucket) when calculating the length. That means it treated the allocation like an array of pointers rather than an array of buckets. As a result, the advised range was much smaller than the actual table allocation. On 64-bit builds, bucket is 64 bytes while bucket is 8 bytes, so only about one eighth of the table was covered. This does not usually break correctness, but it defeats the purpose of the function: after a fork, we want to tell the kernel that the hashtable pages are no longer needed so we reduce copy-on-write overhead. With the wrong size, most of the table memory was never included in that hint. The fix is to use sizeof(bucket) so the full top-level bucket array is passed to zmadvise_dontneed(). Signed-off-by: DaeMyung Kang <charsyam@gmail.com>	2026-04-26 19:45:50 -07:00
Hanxi Zhang	edc0d26ada	Strip LTO flags from static Lua module build (#3555 ) ### Summary The daily CI sanitizer jobs with clang are failing during the build step. When the static Lua module is built with `-flto`, the `.o` files contain LLVM bitcode that gets archived into `libvalkeylua.a`. The system linker cannot read this bitcode, causing build failures: `/usr/bin/ld: /home/runner/work/valkey/valkey/src/modules/lua/libvalkeylua.a: member /home/runner/work/valkey/valkey/src/modules/lua/libvalkeylua.a(debug_lua.o) in archive is not an object` The previous fix (#3546) pinned clang to version 17, but this was insufficient, the issue is not just a version mismatch but that the system linker fundamentally cannot read LTO bitcode from `.a` archives. Example failure: https://github.com/valkey-io/valkey/actions/runs/24865821147/job/72801509768 ### Fix Strip LTO flags from OPTIMIZATION in the Lua module Makefile using `override` Tested: https://github.com/hanxizh9910/valkey/actions/runs/24913834442 --------- Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>	2026-04-26 19:28:21 -07:00
Ping Xie	c861184762	Implement Provenance Guard (#3109 ) This PR bootstraps Valkey's provenance guard integration. The provenance guard is a content-based similarity detection system that helps maintain proper code provenance by comparing incoming PR changes against fingerprint databases built from Redis commits and PRs. The matching logic now lives in the external `valkey-io/verify-provenance` action repository; this PR wires Valkey to that action and seeds the required database branch. Key features: * Content-based detection: Uses normalized diff fingerprints and fuzzy matching to detect similar changes, including cases where files have moved or been refactored. * Externalized action logic: The check and refresh implementation is maintained in `valkey-io/verify-provenance` and is pinned by exact commit SHA from Valkey workflows. * Provenance Guard workflow: Runs on PR activity to check incoming changes against the provenance databases and report potential matches. * Daily Refresh workflow: Runs daily to refresh PR fingerprints and commits updated data back to `verify-provenance-db`. * Dedicated DB branch: Stores provenance databases on the orphan `verify-provenance-db` branch, separate from Valkey source code. * Privacy-first storage: Stores compressed non-reversible fingerprints, not source code. The initial `verify-provenance-db` branch has been bootstrapped with fingerprints of Redis commits and PRs. --------- Signed-off-by: Ping Xie <pingxie@outlook.com>	2026-04-26 14:36:18 -07:00
Rain Valentine	a7d495352a	extra UT Signed-off-by: Rain Valentine <rsg000@gmail.com>	2026-04-24 15:40:54 -07:00
Rain Valentine	54980ece3a	hashtable iterator safety: invalidate on exhaustion Signed-off-by: Rain Valentine <rsg000@gmail.com>	2026-04-24 15:40:54 -07:00
Sarthak Aggarwal	d2db0c268c	Fix module commandresult event cleanup during unsubscribe and module unload (#3545 ) This follows up on the commandresult API work and fixes cleanup around unsubscribe and module unload. The main issue was that command-result event listeners could leave stale state behind. On unload, we removed the listeners themselves but didn’t fully update the fast-path listener counters. Separately, unsubscribing with a NULL callback could behave badly if the listener wasn’t present anymore. In practice, that meant later commands could still walk into command-result event handling after the module was supposed to be cleaned up. Failed in Daily as well yesterday: https://github.com/valkey-io/valkey/actions/runs/24753491944/job/72421581610#step:10:852 Related Failures: https://github.com/valkey-io/valkey/pull/2936#issuecomment-4290490199 --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>	2026-04-23 19:10:20 -07:00
Harkrishn Patro	cb2cfdd4e0	Revert "Pin clang to version 17 in sanitizer CI jobs" (#3556 ) Reverts valkey-io/valkey#3546 This didn't help fix the build issue. Follow up PR is performed on https://github.com/valkey-io/valkey/pull/3555 Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>	2026-04-23 19:01:52 -07:00
Hanxi Zhang	7db5b70737	Pin clang to version 17 in sanitizer CI jobs (#3546 ) ### Analysis The daily CI sanitizer jobs with clang are failing during the build step. The `ubuntu-latest` runner now has clang 18, but the LLVM gold plugin is still version 17. When the static Lua module is built with `-flto`, the `.o` files contain LLVM 18 bitcode that the gold plugin (v17) cannot read: `bfd plugin: LLVM gold plugin has failed to create LTO module: Unknown attribute kind (91) (Producer: 'LLVM18.1.3' Reader: 'LLVM 17.0.6') ` Example failure: https://github.com/valkey-io/valkey/actions/runs/24753491944/job/72421581512 ### Fix Pin the sanitizer jobs to `clang-17` so the compiler and gold plugin versions match. Tested(successfully built): https://github.com/hanxizh9910/valkey/actions/runs/24859845008 ### Note If `clang-17` is removed from the `ubuntu-latest` image in the future, we may need to either add an explicit install step Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>	2026-04-23 16:14:24 -07:00
Sarthak Aggarwal	5abf79e0e3	Add zmalloc_aligned() and fix SPMC queue buffer alignment (#3504 ) The SPMC queue from #3324 needs each `spmcCell` to be cache-line aligned, but plain `zmalloc()` does not guarantee that in all build configurations. This change introduces `zmalloc_cache_aligned()` and uses it for the SPMC queue buffer allocation in `spmcInit()`. Failing CI: https://github.com/valkey-io/valkey/actions/runs/24374139344 --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>	2026-04-23 11:46:22 -07:00
charsyam	9709843446	Optimize HGETDEL to pause auto shrink when deleting multiple items (#3535 ) Match HGETDEL with the existing batch-delete pattern used by HDEL. HDEL already pauses hashtable auto-shrink while deleting multiple fields so shrink evaluation is deferred until the batch completes. HGETDEL was missing the same optimization even though it also deletes fields in a loop. Pause auto-shrink for hashtable-encoded hashes before the HGETDEL delete loop and resume it once afterwards. This preserves observable behavior and reduces redundant shrink work for multi-field deletes. Same as #3144. Signed-off-by: DaeMyung Kang <charsyam@gmail.com>	2026-04-23 12:56:52 +08:00
Madelyn Olson	651c40a89e	Fix FD leak in connSocketBlockingConnect on timeout (#3541 ) ## Summary Fix a file descriptor leak in `connSocketBlockingConnect()` when `aeWait()` times out. ## Bug When `anetTcpNonBlockConnect()` succeeds but `aeWait()` times out (e.g., MIGRATE to an unreachable host), the fd is leaked because it was never assigned to `conn->fd`. The caller's `connClose()` checks `conn->fd != -1` and skips cleanup. ## Fix Assign `conn->fd = fd` immediately after `anetTcpNonBlockConnect()` succeeds, before `aeWait()`. This way the caller's normal `connClose()` cleanup path handles the fd on any error, which is consistent with how the rest of the connection lifecycle works. TLS connections also benefit since `connTLSBlockingConnect` delegates to this function for the TCP layer. ## Reproducer ``` valkey-cli SET key hello # Repeat against unreachable host: for i in $(seq 1 30); do valkey-cli MIGRATE 192.0.2.1 6379 key 0 500; done # Check: /proc/<pid>/fd shows 30 leaked socket fds ``` This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2026-04-23 12:34:34 +08:00
Binbin	c403eecd5b	Fix double free in stream consumer PEL loading with corrupt RDB data (#3498 ) There is a double free issue in the code. The error handling path called both decrRefCount(o) and streamFreeNACK(nack), but the nack was obtained from cgroup->pel via raxFind and is still referenced there. decrRefCount(o) frees it through freeStream -> streamFreeCG -> raxFreeWithCallback(cg->pel, zfree), so the explicit streamFreeNACK(nack) causes a double free. Remove the redundant streamFreeNACK(nack) call and add a regression test with a crafted corrupt payload that triggers the duplicate consumer PEL entry path. This was introduced in `492d8d0961`. Signed-off-by: Binbin <binloveplay1314@qq.com>	2026-04-23 12:32:10 +08:00
Roshan Khatri	04896c1e6d	Deflake many-slot-migration under valgrind (#3462 ) ## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by #3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>	2026-04-23 12:31:32 +08:00
Deepak Nandihalli	3ab9d9797e	Fix race condition during async client freeing with IO threading enabled (#3458 ) When close_asap flag is set, set bytes read to 0 In the readToQueryBuf, the c->nread represents the number of bytes read. When close_asap flag is set, there is a bug where the c->nread isn't reset to 0 and this breaks the invariant. IOThreads then incorrectly think there is data to read and results in a crash. This change fixes this bug. To elaborate on the race possible: 1. Let's say that a IO thread job for reading query from a client got enqueued as part of a epoll - https://github.com/valkey-io/valkey/blob/unstable/src/io_threads.c#L417. 2. Later the client gets freed async and is marked as close_asap - https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L2175 3. While processing the io_thread job for the client, it invokes iothreadReadQueryFromClient. Here, [`readToQueryBuf`](https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L6497) returns as a no-op since the client is marked close-asap. Also, the c->nread is not reset to 0 and count contain the value from a previous read. 4. Later parseInputBuffer [gets invoked](https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L6514). 5. The parseInputBuffer then [accesses the query_buf](https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L3864). The query_buf here would be null in resetSharedQueryBuf as part of beforeNextClient. Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>	2026-04-22 17:39:44 -07:00
Sarthak Aggarwal	03c2d4c2a2	Stabilize diskless no-drop replication test (#3511 ) This deflakes all variants of `diskless replicas drop during rdb pipe`. The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile. CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515 --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com> Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>	2026-04-22 00:14:18 +02:00
Yang Zhao	fc00f7be03	Fix VLA warning in io_threads (#3518 ) https://github.com/valkey-io/valkey/pull/3324 introduced `BATCH_SIZE` as a const int local variable and used it as an array bound. Clang 17 rejects this with: ``` io_threads.c:305:22: error: variable length array folded to constant array as an extension [-Werror,-Wgnu-folding-constant] 305 \| void batch_jobs[BATCH_SIZE]; \| ^~~~~~~~~~ 1 error generated. make[1]: [io_threads.o] Error 1 make: * [all] Error 2 ``` Old Clang versions do not emit this warning, maybe that is why the CI passed. Fix by promoting `BATCH_SIZE` to a file-scope `#define`. Signed-off-by: Yang Zhao <zymy701@gmail.com>	2026-04-21 13:07:58 -07:00
martinrvisser	6444717517	Module command result callback addition (#2936 ) ## Add Command Result Event Notifications for Modules ### Summary 1. Adds new server events `ValkeyModuleEvent_CommandResultSuccess` and `ValkeyModuleEvent_CommandResultFailure` for that can notify subscribed modules after command execution. This enables modules to implement audit logging, error monitoring, performance tracking, and observability without modifying core server code. 2. Adds new server event `ValkeyModuleEvent_CommandResultACLDenied` for commands rejected by ACL. Together with PR #2237 this covers auditing of authentication and authorisation. ### Motivation There is currently no module API to observe command outcomes after execution or to capture ACL denied commands. Modules that need audit logging or error monitoring have no mechanism to be notified when commands succeed or fail, what arguments were used, how long they took, or how many keys were modified. This feature fills that gap using the existing `ValkeyModule_SubscribeToServerEvent()` infrastructure. ### API #### Events \| Event \| Description \| \|---\|---\| \| `ValkeyModuleEvent_CommandResultSuccess` \| Fired after a command completes successfully \| \| `ValkeyModuleEvent_CommandResultFailure` \| Fired after a command returns an error \| \| `ValkeyModuleEvent_CommandACLDenied` \| Fired after a command is rejected by ACL \| These are separate events (not sub-events), so modules can for example only subscribe to failures without incurring any callback overhead for successful commands. #### Event Data: `ValkeyModuleCommandResultInfo` The `data` pointer passed to the callback can be cast to `ValkeyModuleCommandResultInfo`: ```c typedef struct ValkeyModuleCommandResultInfo { uint64_t version; /* Version of this structure for ABI compat. / const char command_name; /* Full command name (e.g., "SET", "CLIENT\|LIST"). / long long duration_us; / Execution duration in microseconds. / long long dirty; / Number of keys modified. / uint64_t client_id; / Client ID that executed the command. / int is_module_client; / 1 if command was from RM_Call, 0 otherwise. / int argc; / Number of command arguments. / ValkeyModuleString argv; / Command arguments array (zero-copy, read-only). / int acl_deny_reason; / ACL_DENIED_CMD/KEY/CHANNEL/AUTH; 0 for non-ACL events / const char acl_object; /* Denied resource name (key/channel); NULL for CMD/AUTH / } ValkeyModuleCommandResultInfoV1; ``` The struct is versioned (`VALKEYMODULE_COMMANDRESULTINFO_VERSION`) for forward-compatible API evolution. ### Usage Example ```c / Callback receives events for whichever event(s) you subscribed to / void OnCommandResult(ValkeyModuleCtx ctx, ValkeyModuleEvent eid, uint64_t subevent, void data) { VALKEYMODULE_NOT_USED(ctx); VALKEYMODULE_NOT_USED(subevent); ValkeyModuleCommandResultInfo info = (ValkeyModuleCommandResultInfo )data; if (info->version != VALKEYMODULE_COMMANDRESULTINFO_VERSION) return; int failed = (eid.id == VALKEYMODULE_EVENT_COMMAND_RESULT_FAILURE); / Access fields directly / printf("command=%s status=%s duration=%lldus dirty=%lld client=%llu\n", info->command_name, failed ? "FAIL" : "OK", info->duration_us, info->dirty, info->client_id); / Access argv (read-only, zero-copy) / for (int i = 0; i < info->argc; i++) { size_t len; const char arg = ValkeyModule_StringPtrLen(info->argv[i], &len); printf(" argv[%d] = %.s\n", i, (int)len, arg); } } / Subscribe in ValkeyModule_OnLoad or at runtime / / Option A: command failures only (recommended for audit logging) / ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultFailure, OnCommandResult); / Option B: command successes only / ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultSuccess, OnCommandResult); / Option C: both command outcomes/ ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultSuccess, OnCommandResult); ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultFailure, OnCommandResult); / Subscribe to ACL Denied / ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultACLDenied, onCommandResult); / Unsubscribe pass NULL callback / ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultFailure, NULL); ``` ### Design Decisions - Separate events instead of sub-events: Modules subscribing only to failures have zero overhead for successful commands (~2ns listener-list check vs ~30ns callback invocation per command). This is critical since success events fire on the hot path of every command. - Stack-allocated info struct: The `ValkeyModuleCommandResultInfoV1` is built on the stack ΓÇö no heap allocation per event. - Zero-copy argv: Arguments are passed directly from the client's argv array. Any integer-encoded arguments (from `tryObjectEncoding()` during command execution) are decoded to string-encoded objects before being passed to the callback, ensuring compatibility with `ValkeyModule_StringPtrLen()`. - Early exit: If no modules are subscribed to any server events, the event firing function returns immediately before building the info struct. - Uses existing server event infrastructure*: Follows the `ValkeyModule_SubscribeToServerEvent()` pattern used by all other server events, rather than introducing a new callback mechanism. ### Files Changed \| File \| Change \| \|---\|---\| \| `src/valkeymodule.h` \| Event IDs, event constants, `ValkeyModuleCommandResultInfoV1` struct \| \| `src/module.c` \| `moduleFireCommandResultEvent()`, event documentation, event version entries \| \| `src/module.h` \| Function declaration \| \| `src/server.c` \| Call `moduleFireCommandResultEvent()` from `call()` after command execution \| \| `src/server.c` \| Call to `moduleFireCommandACLDeniedEvent` in `processCommand` after ACL rejection \| \| `tests/modules/commandresult.c` \| Test module exercising the full API \| \| `tests/unit/moduleapi/commandresult.tcl` \| Integration tests \| --------- Signed-off-by: martinrvisser <mvisser@hotmail.com> Signed-off-by: martinrvisser <martinrvisser@users.noreply.github.com> Co-authored-by: Ricardo Dias <rjd15372@gmail.com>	2026-04-21 09:14:14 -04:00
Dietrich Daroch	9d51f5ff8a	Document VALKEYCLI_HOST/PORT variables in help (#3520 ) Follow-up to #3402 as we missed documenting this. --- Signed-off-by: Dietrich Daroch <Dietrich@Daroch.me>	2026-04-21 11:52:54 +02:00
eifrah-aws	0327c27131	Add Static Module Support (#3392 ) Add a build option to compile the Lua scripting engine as a static module and wire the server to load it directly at startup when enabled. The module load path now resolves on-load and on-unload entry points from the main binary, and the module lifecycle keeps those callbacks so unload works without a shared library handle. The Lua module build was updated to support both static and shared variants, with the static path exporting visible wrapper symbols and linking the server with the module archive. While touching the Lua code, a few internal symbols were renamed for consistency and the monotonic time helper was clarified. Note that this PR addresses the LUA module, but it can be applied to other "core" modules (like: Bloom, Json, Search and others). With this change, it will be easier to ship Valkey bundle with modules. Areas touched: * CMake * Makefile * Lua scripting module * Core module loading Generated by CodeLite --------- Signed-off-by: Eran Ifrah <eifrah@amazon.com>	2026-04-20 14:45:57 +03:00
Daniil Kashapov	269b1c5eda	Improve COB memory tracking with copy avoidance (#3306 ) This improves COB memory tracking when using copy avoidance for bulk string replies. This fix addresses underestimation of client memory usage that occurred when reply buffers stored pointers to shared `robj` instead of copying data. IO threads calculate actual reply sizes by calling `sdslen()` on strings before writing, for that we need atomic `tracked_for_cob` flag in payload headers to prevent race conditions and double accounting. See #2396 --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>	2026-04-20 14:45:51 +03:00
Madelyn Olson	4a42c95853	Fix HPERSIST RESP protocol violation on wrong-type key (#3516 ) `hpersistCommand` calls `addReplyArrayLen` before `lookupKeyWrite` + `checkType`. When HPERSIST targets a non-hash key, the server writes a RESP array header followed by a WRONGTYPE error — a malformed response that permanently desynchronizes the client connection. This moves `lookupKeyWrite` + `checkType` before `addReplyArrayLen`, matching the pattern used by every other HFE command (e.g. `hgetdelCommand`, `hexpireGenericCommand`). Added a test for HPERSIST on a wrong-type key. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>	2026-04-17 17:02:04 -07:00
Sarthak Aggarwal	109ef346f3	[Flaky Tests] Avoid re-triggering io-thread activation (#3509 ) The test was accidentally waking the IO threads while trying to check that they had gone idle. After the recent IO-thread refactor in #3324, the [test](https://github.com/valkey-io/valkey/pull/3324/changes#diff-21314ec3a338f739eab1536f91f528d1efe7c6a93935a71b9c02f77a3858f121R112) started forcing `io-threads-always-active`, and its repeated `INFO` polling counted as fresh activity. So instead of just observing the worker threads, the test kept reactivating them and then flaked. --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com> Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>	2026-04-17 17:00:43 -07:00
Roshan Khatri	b2d08c9ef9	Fix use-after-unload crash in test auth module's blocking thread (#3464 ) ## Problem The test `Test module aof save on server start from empty` in `tests/unit/moduleapi/hooks.tcl` sporadically crashes with `I/O error reading reply`. Frequency: 2 out of 15 days (March 26 on `centosstream9-tls-module-no-tls`, April 8 on `fedorarawhide-jemalloc`). Example failing run: https://github.com/valkey-io/valkey/actions/runs/24110987718/job/70345236353 ## Root Cause The crash is a use-after-unload in the auth test module's blocking authentication thread, NOT a timing issue in the AOF test. The crash log from April 8 shows: ``` 71112:M 00:42:59.710 * Module testacl unloaded 71112:M 00:42:59.711 # crashed by signal: 11, si_code: 1 71112:M 00:42:59.711 # Crashed running the instruction at: 0x7f9dc717384b ``` The sequence: 1. `blocking_auth_cb` spawns a background thread (`AuthBlock_ThreadMain`) that sleeps 500ms 2. Thread wakes, calls `ValkeyModule_UnblockClient()` → main thread processes unblock, decrements `module->blocked_clients` 3. Auth command completes, test calls `r module unload testacl` 4. `moduleUnloadInternal` checks `blocked_clients == 0` if true, proceeds with `dlclose()` 5. But the background thread is still executing cleanup code (freeing strings, returning from function) 6. Thread returns into unmapped memory → SIGSEGV The `invalidFunctionWasCalled` in the stack trace is the crash handler's safety stub, and the crashing address `0x7f9dc717384b` is in the unmapped auth.so address space. ## Fix Track the background thread ID and `pthread_join()` it in `ValkeyModule_OnUnload` before the module is dlclose'd. This ensures the thread has fully exited before the code is unmapped. The key insight is that `ValkeyModule_UnblockClient()` signals "auth is done" but not "thread is done" — the thread still has cleanup code to execute after that call. `pthread_join()` is the correct synchronization point because it only returns after the thread has fully exited. No mutex is needed since both `blocking_auth_cb` (which creates the thread) and `OnUnload` (which joins it) run on the main event loop thread. Changes to `tests/modules/auth.c`: - Add global `blocking_auth_tid` and `blocking_auth_tid_valid` flag - Set `blocking_auth_tid_valid = 1` after successful `pthread_create` - In `OnUnload`, `pthread_join` the thread if one was created ## Testing Ran `unit/moduleapi/hooks` 100 loops on rpm-distros and ubuntu runners — all passed: - Workflow run: https://github.com/roshkhatri/valkey/actions/runs/24164276124 - Config: `--loops 100 --single unit/moduleapi/hooks` on `almalinux8`, `almalinux9`, `fedoralatest`, `fedorarawhide`, `centosstream9`, `ubuntu-jemalloc`, `ubuntu-arm` - Result: 7/7 jobs ✅, zero failures across 700 total test iterations Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>	2026-04-17 10:05:03 -07:00
Viktor Söderqvist	8a91a12398	Unique samples in hashtableSampleEntries (#3460 ) Instead of "scanning" random bucket chains using a random cursor for each scan call, start at a random cursor and then continue sampling buckets in scan order. The scan stops when we have sampled all elements, so the cursor never wraps around to sample the same buckets again. This ensures that we don't get any duplicate samples. The functions hashtableRandomEntry and hashtableFairRandomEntry keeps the old behavior using another sampling function preserving their behavior. The fairness tests fail if we use the modified hashtableSampleEntries. This restores the behavior of dictGetSomeKeys (which is now an alias of hashtableSampleEntries) and deflakes the test case: Gossip count scales with higher percentage of `cluster-message-gossip-perc` in tests/unit/cluster/packet.tcl Fixes #3454 --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2026-04-16 16:53:47 +02:00
Binbin	087280e3f8	Rewrite the faster failover test case (#3495 ) The test introduced in #2227 is fragile because it unconditionally asserted that "best ranked replica" would be logged for every shard's replica. This assumption was incorrect for two reasons: 1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0. When two primaries fail simultaneously, only the shard with the lowest shard_id gets failed_primary_rank == 0. The other shard's replicas can only trigger "Myself become the best ranked replica" after the first shard completes failover and the result propagates via gossip. 2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas to have their MY_PRIMARY_FAIL flag propagated via gossip. If the election delay is shorter than the gossip propagation time, the replica may start the election before receiving the flag. It's also too strict and checks too many things. Like we can check that there is no delay before starting the failover, but we shouldn't check that the rank 0 replica actually wins, because maybe it doesn't. Also we're not sure the other replica does a psync afterwards, it can also do a full sync. Restructure the test into three focused scenarios with retry logic: - Test 0 (3 primaries + 1 replica): The sole replica is deterministically the best ranked replica. All conditions are trivially satisfied. - Test 1 (3 primaries + 2 replicas): Single primary failure with two competing replicas. failed_primary_rank is deterministically 0, but MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs. Uses internal retry with cluster recovery to ensure at least one successful trigger. - Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no failover timeout and uses retry to cover the "best ranked replica" path for at least one shard. Signed-off-by: Binbin <binloveplay1314@qq.com>	2026-04-16 19:15:02 +08:00
Harkrishn Patro	b2e0f6322d	Add cluster bus network traffic usage metric in bytes (#3396 ) Adds cluster bus byte metrics to `CLUSTER INFO`, split by admin, pub/sub, and module traffic. The change tracks sent and received bytes on the cluster bus, exposes them as `cluster_stats_*_bytes_{sent,received}`. Example: ``` > src/valkey-cli CLUSTER INFO cluster_state:fail ... cluster_stats_bytes_sent:46088 cluster_stats_bytes_received:46080 cluster_stats_pubsub_bytes_sent:0 cluster_stats_pubsub_bytes_received:0 cluster_stats_module_bytes_sent:0 cluster_stats_module_bytes_received:0 total_cluster_links_buffer_limit_exceeded:0 ``` Fixes: https://github.com/valkey-io/valkey/issues/1929 --------- Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>	2026-04-15 13:54:52 -07:00
Binbin	a34f125d8e	Change rdbSaveStreamConsumers return type from size_t to ssize_t (#3499 ) Minor cleanup, the function will return -1 on error. It worked anyway before because it is implicitly converted: -1 => SIZE_MAX => -1 again when it's implicitly cast back to ssize_t. Signed-off-by: Binbin <binloveplay1314@qq.com>	2026-04-15 10:54:06 +08:00
Dietrich Daroch	baa5a48b5a	Env: Read VALKEYCLI_HOST and VALKEYCLI_PORT environment variables (#3402 ) This allows setting the host and port using environment variables, which makes using valkey-cli way easier in some development environments. No need to recall the right host/port, just let your `.env` do the work. --- Built locally and tested with, ```fish VALKEY_PORT=36379 ./src/valkey-cli ``` ```console 127.0.0.1:36379> ``` And using both, ```fish VALKEYCLI_HOST=(hostname).local VALKEYCLI_PORT=36379 ./src/valkey-cli ``` ```console <HOSTNAME>.local:36379> ``` Signed-off-by: Dietrich Daroch <Dietrich@Daroch.me>	2026-04-14 19:04:32 +02:00
charsyam	c4db208e9b	enhance raxLowWalk: use memchr() for child-edge lookup in non-compressed nodes (#3472 ) Replace the open-coded byte-by-byte loop in raxLowWalk() with memchr(). libc implementations of memchr() on common platforms are SIMD-optimized (SSE2/AVX2 on x86_64, NEON on arm64), which significantly outperforms a scalar loop while remaining faster than a binary search at the small fan-out sizes (<= 256) that rax nodes can have. # in macOS ARM(M4 pro) ## Lookup Performance (Mops/s, , Avg with 5 Repeats) \| nkeys \| keylen \| unstable (scalar) \| memchr \| speedup \| \|-------\|--------\|-------------------\|--------\|---------\| \| 10K \| 8 \| 15.62 \| 30.19 \| 1.93x \| \| 100K \| 4 \| 8.18 \| 20.53 \| 2.51x \| \| 200K \| 16 \| 5.00 \| 13.71 \| 2.74x \| \| 500K \| 20 \| 3.33 \| 7.63 \| 2.29x \| \| 1M \| 16 \| 2.92 \| 6.53 \| 2.24x \| \| 200K \| 64 \| 3.59 \| 5.05 \| 1.41x \| ## Insert Performance (Mops/s, Avg with 5 Repeats) \| nkeys \| keylen \| unstable (scalar) \| memchr \| speedup \| \|-------\|--------\|-------------------\|--------\|---------\| \| 10K \| 8 \| 5.14 \| 5.79 \| 1.13x \| \| 100K \| 4 \| 4.60 \| 6.11 \| 1.33x \| \| 200K \| 16 \| 4.08 \| 5.20 \| 1.28x \| \| 500K \| 20 \| 3.52 \| 5.06 \| 1.44x \| \| 1M \| 16 \| 3.68 \| 4.73 \| 1.29x \| \| 200K \| 64 \| 3.85 \| 4.86 \| 1.26x \| # in Linux(Linux x86_64 (Ryzen 7 8845HS, GCC 13.3, Avg with 5 Repeats) ## Lookup Performance (Mops/s, , Avg with 5 Repeats) \| nkeys \| keylen \| unstable (scalar) \| memchr \| speedup \| \|-------\|--------\|-------------------\|--------\|---------\| \| 10K \| 8 \| 16.68 \| 33.99 \| 2.04x \| \| 100K \| 4 \| 9.55 \| 21.13 \| 2.21x \| \| 200K \| 16 \| 5.62 \| 8.18 \| 1.46x \| \| 500K \| 20 \| 3.84 \| 5.47 \| 1.42x \| \| 1M \| 16 \| 3.55 \| 4.93 \| 1.39x \| \| 200K \| 64 \| 3.54 \| 4.40 \| 1.24x \| ## Insert Performance (Mops/s, Avg with 5 Repeats) \| nkeys \| keylen \| unstable (scalar) \| memchr \| speedup \| \|-------\|--------\|-------------------\|--------\|---------\| \| 10K \| 8 \| 4.97 \| 6.32 \| 1.27x \| \| 100K \| 4 \| 4.37 \| 5.29 \| 1.21x \| \| 200K \| 16 \| 4.00 \| 4.85 \| 1.21x \| \| 500K \| 20 \| 3.79 \| 4.79 \| 1.26x \| \| 1M \| 16 \| 3.50 \| 4.36 \| 1.25x \| \| 200K \| 64 \| 3.50 \| 4.29 \| 1.23x \| Actually I tested binary search method if h->size is greater than T(8, 16, 32) and SIMD implementation. SIMD(especially NEON in arm is a little bit faster than this PR). but memchr is stable and robust and it is more readable. Thanks. Signed-off-by: charsyam <charsyam@naver.com>	2026-04-13 19:23:14 -07:00
Akash Kumar	a3a839972f	Redesign IO threading communication model (#3324 ) #### Summary This PR redesigns the IO threading communication model, replacing the inefficient client-list polling approach with a high-performance, lock-free queue architecture. This change improves throughput by 8–17% across various workloads and lays the groundwork for offloading command execution to IO threads in following PRs. ### Performance Comparison: Unstable vs New IO Queues \| Type \| Operation \| Unstable Branch (M TPS) \| New IO Queues (M TPS)\| Difference (%) \| \| :--- \| :--- \| :--- \| :--- \| :--- \| \| CME<sup>1</sup> \| SET \| 1.02 \| 1.19 \| +16.67% \| \| CME \| GET \| 1.30 \| 1.47 \| +13.08% \| \| CMD<sup>2</sup> \| SET \| 1.15 \| 1.35 \| +17.39% \| \| CMD \| GET \| 1.52 \| 1.64 \| +7.89% \| <sup>1</sup> Amazon terminology for cluster mode <sup>2</sup> Amazon termonology for standalone mode, i.e. config `cluster-enabled no` - Test Configuration: 8 IO threads • 400 clients • 512-byte values • 3M keys #### Motivation The previous IO model had several limitations that created performance bottlenecks: * Inefficient Polling: The main thread lacks a direct notification mechanism for completed work. Instead, it must constantly iterate through a list of all pending clients to check their state, wasting significant CPU cycles. * Manual Load Balancing: Jobs are assigned to specific threads upfront. This requires the main thread to predict which thread to use, often leaving some threads idle while others are overloaded. * Static Scaling: Thread activation relies on a fixed heuristic (e.g., 1 thread per 2 events). This approach fails to adapt to varying workloads, such as TLS connections or differing read/write sizes. ### The Solution To address these inefficiencies, this PR replaces the single SPSC queue used currently with three specialized queues to handle communication and load balancing more effectively. #### 1. Main > IO: Shared Queue (Single Producer Multi Consumer) Single queue from the main-thread to IO threads. * Automatic Load Balancing: All threads pull from the same source. Busy threads take less work, and idle threads take more, so we don't need to manually select a thread. * Adaptive Scaling: We now use the queue depth to decide when to add or remove threads. If the queue is full, we scale up; if it's empty, we scale down. * Ignition: To get things started before the queue fills up, we monitor the main thread's CPU. If usage goes over 30%, we wake up the first IO thread. * Implementation: To prevent contention among consumers, each item in the ring buffer is padded to reside in its own cache line. Sequence numbers are utilized to indicate whether a cell is empty or populated, allowing threads to safely claim work. #### 2. IO > Main: The Response Channel (MPSC Queue) We replaced the old polling loop with a response queue. * Faster Completion: IO threads push completed jobs into this queue. The main thread detects new data simply by checking if the queue is not empty, removing the need to scan pending clients. * Contention Management: To avoid lock contention, each thread reserves a slot by atomically incrementing the tail index. In the rare event that the queue is full, pending jobs are buffered in a local temporary list until space becomes available. #### 3. MAIN > IO (Thread-Specific): Private Inbox (SPSC Queue) We kept the existing Single-Producer Single-Consumer (SPSC) queues for tasks that must happen on a specific thread (like freeing memory allocated by that thread). IO threads always check their private inbox before looking at the shared queue. ### Changes Required * Async client release The main thread no longer busy-waits for IO threads to finish with a client. Since the client must be popped from the multi-producer queue before it can be released, clients with pending IO are now marked for asynchronous closure. * eviction clients logic Updated evictClients() to account for memory pending release (clients marked close_asap). freeClient() now returns a status code (1 for freed, 0 for async-close) to ensure the eviction loop does not over-evict by ignoring memory that is about to be reclaimed. * events-per-io-thread config Replaced the `events-per-io-thread` configuration with `io-threads-always-active`. as we no longer track events, since this config is use only for tests no backward compatibility issue arises. * packed job instead of handlers Jobs are now represented as tagged pointers (using lower 3 bits for job type) instead of separate `{handler, data}` structs. This reduces memory overhead and allows jobs to be passed through the queues as single pointers. * head caching in spsc queue The SPSC queue now caches the `head` index on the producer side (`head_cache`) to avoid frequent atomic loads. The producer only refreshes from the atomic `head` when the cache indicates the queue might be full, reducing cross-thread cache-line bouncing. * deferred commit in SPSC queue. `spscEnqueue()` supports batching via a `commit` flag. Multiple jobs can be enqueued with `commit=false`, then flushed with a single `spscCommit()` call, reducing atomic operations and cache-line bouncing. * rollback on fullness check failure When `spmcEnqueue()` fails due to a full queue, the client state is rolled back (e.g., `io_write_state` reset to `CLIENT_IDLE`). This rollback approach removes the need to call an expensive `isFull` check before every enqueue, we just attempt the enqueue and revert if it fails. * epoll offloading via SPSC at high thread counts. When `active_io_threads_num > 9`, poll jobs are sent to per-thread SPSC queues (round-robin). Since threads check their private queue first, this ensures poll jobs are processed promptly without waiting behind jobs in the shared SPMC queue. * avoid offload write before read comes back Added a check `if (c->io_read_state == CLIENT_PENDING_IO) return C_OK` in `trySendWriteToIOThreads()`. In the previous per-thread SPSC implementation, we could send consecutive read and write jobs for the same client knowing a single thread would handle them in order. With the shared SPMC queue, different threads may pick up the jobs, so we must wait for the read to complete before sending a write to avoid 2 threads handling the same client. * removing pending_read_list_node from client and clients_pending_io_read/write lists from server Removed `pending_read_list_node` from the `client` struct and `clients_pending_io_read`/`clients_pending_io_write` lists from `valkeyServer`. as the new mpsc eliminates the need for these tracking structures. * added inst metrics for pending io jobs Added `instantaneous_io_pending_jobs` metric via `STATS_METRIC_IO_WAIT` to track average queue depth over time. * added stat for current active threads number Added `active_io_threads_num` to the INFO stats output for better visibility. * added internal inst metric for main-thread cpu (non apple compliant) Added `STATS_METRIC_MAIN_THREAD_CPU_SYS` to track main thread CPU usage via `getrusage(RUSAGE_THREAD)`. This powers the "ignition" policy, when CPU exceeds 30%, the first IO thread is activated. `RUSAGE_THREAD` is Linux-specific, so macOS falls back to event-count heuristics. * added stat for pending read and writes for io Added `io_threaded_reads_pending` and `io_threaded_writes_pending` stats to track how many read/write jobs are currently in-flight to IO threads. * added volatile for crashed Changed `server.crashed` from `int` to `volatile int` to ensure the crash flag is visible across threads immediately, allowing IO threads to detect a crash and stop sending responses back to the main thread to avoid deadlock on crash. --------- Signed-off-by: Uri Yagelnik <uriy@amazon.com> Signed-off-by: akash kumar <akumdev@amazon.com> Co-authored-by: Uri Yagelnik <uriy@amazon.com> Co-authored-by: Dan Touitou <dan.touitou@gmail.com>	2026-04-13 13:48:20 -07:00
Ricardo Dias	0ec1733516	Adds new (hidden) ValkeyModule_CallArgv API functions (#3122 ) This PR introduces `ValkeyModule_CallArgv`, a new low-level module API for calling server commands. It is designed as a complement to the existing `ValkeyModule_Call` covering two scenarios where `ValkeyModule_Call` is suboptimal: 1. The module already holds the arguments as an argv array. `ValkeyModule_Call` always allocates a new argv array, creates a new string object for the command name, and calls `incrRefCount`/`decrRefCount` on every argument. `ValkeyModule_CallArgv` borrows the caller's argv array directly — no allocation, no refcount manipulation. Ownership of the array stays with the caller. 2. The module wants to forward the reply to its own client without inspecting it. `ValkeyModule_Call` always parses the raw RESP bytes into an intermediate `ValkeyModuleCallReply` tree and then re-serializes it back to RESP to send to the caller. `ValkeyModule_CallArgv` exposes the raw RESP bytes directly through a callback, allowing a pass-through module command to copy bytes from the inner command's output buffer straight into the outer client's output buffer. --- ## New API ### `ValkeyModule_CallArgv` ```c int ValkeyModule_CallArgv(ValkeyModuleCtx ctx, ValkeyModuleString argv, int argc, int flags, ValkeyModuleReplyHandlers reply_handlers, void reply_ctx); ``` Calls a server command using an existing argv array. The caller retains ownership of `argv` and its elements — they are never freed or ref-counted by the function. Returns `VALKEYMODULE_OK` on success or `VALKEYMODULE_ERR` on error, with `errno` set to the same values as `ValkeyModule_Call`. The function does not* return a `ValkeyModuleCallReply`; the reply is delivered through `reply_handlers` instead. `flags` is a bitwise OR of one or more `VALKEYMODULE_CALL_ARGV_` constants, which correspond directly to the format-string specifiers of `ValkeyModule_Call`: \| Flag \| Equivalent format char \| Meaning \| \|---\|---\|---\| \| `VALKEYMODULE_CALL_ARGV_REPLICATE` \| `!` \| Propagate to replicas and AOF \| \| `VALKEYMODULE_CALL_ARGV_NO_AOF` \| `A` \| Skip AOF propagation \| \| `VALKEYMODULE_CALL_ARGV_NO_REPLICAS` \| `R` \| Skip replica propagation \| \| `VALKEYMODULE_CALL_ARGV_RESP_3` \| `3` \| Force RESP3 for the inner call \| \| `VALKEYMODULE_CALL_ARGV_RESP_AUTO` \| `0` \| Match the calling client's protocol (RESP2 or RESP3) \| \| `VALKEYMODULE_CALL_ARGV_RUN_AS_USER` \| `C` \| Apply ACL checks for the context user \| \| `VALKEYMODULE_CALL_ARGV_SCRIPT_MODE` \| `S` \| Mark as script execution \| \| `VALKEYMODULE_CALL_ARGV_NO_WRITES` \| `W` \| Disallow write commands \| \| `VALKEYMODULE_CALL_ARGV_ERRORS_AS_REPLIES` \| `E` \| Return error replies instead of raising errno \| \| `VALKEYMODULE_CALL_ARGV_RESPECT_DENY_OOM` \| `M` \| Honour deny-oom policy \| \| `VALKEYMODULE_CALL_ARGV_DRY_RUN` \| `D` \| Dry-run mode (implies `ERRORS_AS_REPLIES`) \| \| `VALKEYMODULE_CALL_ARGV_ALLOW_BLOCK` \| `K` \| Allow the inner command to block \| \| `VALKEYMODULE_CALL_ARGV_REPLY_EXACT` \| `X` \| Disable reply type coercion \| --- ### `ValkeyModuleReplyHandlers` ```c typedef struct ValkeyModuleReplyHandlers { / Scalar types / void (null)(void ctx); void (bulkString)(void ctx, const char str, size_t len); void (simpleString)(void ctx, const char str, size_t len); void (error)(void ctx, const char msg, size_t len); void (integer)(void ctx, long long val); void (doubleVal)(void ctx, double val); void (boolVal)(void ctx, int val); void (bigNumber)(void ctx, const char str, size_t len); void (verbatimString)(void ctx, const char str, size_t len, const char fmt); / Collection types — paired start/end callbacks / void (arrayStart)(void ctx, size_t len); void (arrayEnd)(void ctx); void (mapStart)(void ctx, size_t len); void (mapEnd)(void ctx); void (setStart)(void ctx, size_t len); void (setEnd)(void ctx); void (attributeStart)(void ctx, size_t len); void (attributeEnd)(void ctx); / Invoked on parse error / void (replyParsingError)(void ctx); / Raw-bytes intercept — called before per-type callbacks / int (onRespAvailable)(void ctx, ValkeyModuleCtx module_ctx, const char proto, size_t proto_len); / Blocking command support / void (onBlocked)(void ctx, ValkeyModuleCtx module_ctx, ValkeyModuleCallArgvBlockedHandle handle); void context; /* Passed as first argument to every callback / } ValkeyModuleReplyHandlers; ``` All fields are optional; set unused ones to `NULL`. `onRespAvailable`* is called first, before any per-type callback, with the complete raw RESP reply in the `proto`/`proto_len` arguments. Return `1` to continue into the per-type callbacks; return `0` to stop after the raw bytes. Pass `NULL` to skip and go straight to per-type dispatching. `onBlocked` is only called when `VALKEYMODULE_CALL_ARGV_ALLOW_BLOCK` is set and the inner command blocks. The `handle` can be passed to `ValkeyModule_CallArgvAbort` to cancel the call. The handle is valid until either `onRespAvailable` is invoked or `ValkeyModule_CallArgvAbort` is called, whichever comes first. --- ### `ValkeyModule_CallArgvAbort` ```c int ValkeyModule_CallArgvAbort(ValkeyModuleCallArgvBlockedHandle handle); ``` Cancels a blocking call whose `onBlocked` callback has already fired. Returns `VALKEYMODULE_OK` if the abort succeeded (neither `onRespAvailable` nor any other reply callback will be invoked), or `VALKEYMODULE_ERR` if the call has already completed. --- ### `ValkeyModule_ReplyRaw` ```c int ValkeyModule_ReplyRaw(ValkeyModuleCtx ctx, const char proto, size_t proto_len); ``` Appends raw RESP bytes directly to the calling client's output buffer, with no parsing or re-serialization. Intended for use inside an `onRespAvailable` callback to implement zero-copy reply forwarding. --- ## Usage Example: `ValkeyModule_Call` vs `ValkeyModule_CallArgv` The following two implementations of a generic pass-through proxy command are functionally equivalent, but use different code paths internally. ### With `ValkeyModule_Call` (existing API) ```c int ProxyCommand(ValkeyModuleCtx ctx, ValkeyModuleString *argv, int argc) { if (argc < 2) return ValkeyModule_WrongArity(ctx); / VM_Call builds a new argv array internally, creates a string object for * the command name, and calls incrRefCount/decrRefCount on every argument. * The reply is parsed into a CallReply tree, then re-serialized to RESP. / ValkeyModuleCallReply reply = ValkeyModule_Call(ctx, ValkeyModule_StringPtrLen(argv[1], NULL), "v", argv + 2, (size_t)(argc - 2)); return ValkeyModule_ReplyWithCallReply(ctx, reply); } ``` ### With `ValkeyModule_CallArgv` and `onRespAvailable` (new API — pass-through) ```c static int forwardRawReply(void ctx, ValkeyModuleCtx mctx, const char proto, size_t proto_len) { (void)ctx; / Write the raw RESP bytes directly to the outer client — no parse, no * re-serialize, no intermediate allocation. / ValkeyModule_ReplyRaw(mctx, proto, proto_len); return 0; / stop — do not invoke per-type callbacks / } static ValkeyModuleReplyHandlers passthrough_handlers = { .onRespAvailable = forwardRawReply, }; int ProxyCommand(ValkeyModuleCtx ctx, ValkeyModuleString *argv, int argc) { if (argc < 2) return ValkeyModule_WrongArity(ctx); / VM_CallArgv borrows argv directly (no copy, no refcount changes). * The reply is written straight into the outer client's output buffer * through the onRespAvailable callback above. */ if (ValkeyModule_CallArgv(ctx, argv + 1, argc - 1, VALKEYMODULE_CALL_ARGV_RESP_AUTO, &passthrough_handlers, NULL) == VALKEYMODULE_ERR) { return ValkeyModule_ReplyWithError(ctx, "ERR command failed"); } return VALKEYMODULE_OK; } ``` Use `VALKEYMODULE_CALL_ARGV_RESP_AUTO` to ensure the inner command replies in the same protocol version as the outer client (RESP2 or RESP3), so the raw bytes can be forwarded without modification. --- ## Internal changes ### `argv_borrowed` client flag A new `argv_borrowed` bit is added to the client flags struct. When `VM_CallArgv` runs the inner command on a temporary client, it sets this flag to signal that the temp client does not own the argv array. `freeClientArgv` and `freeClientOriginalArgv` respect the flag and skip freeing. `resetClient` clears it unconditionally after each command. Several string commands (`SET`, `MSET`, `APPEND`, …) that previously encoded or reused argv objects in-place are guarded so they do not mutate argv elements they do not own. A debug-only assertion (enabled when `enable-debug-assert yes` is set in the config) verifies at runtime that no command modifies the argv array or decrements any element's ref-count when `argv_borrowed` is set. --------- Signed-off-by: Ricardo Dias <ricardo.dias@percona.com> Signed-off-by: Ricardo Dias <rjd15372@gmail.com>	2026-04-13 12:48:41 +02:00
Binbin	2871efd436	Add cluster-config-save-behavior option to control nodes.conf save behavior (#3372 ) This commit introduces a new configuration option `cluster-config-save-behavior` that controls how the cluster handles nodes.conf file save failures. The option supports two modes: - `sync` (default): Synchronously save the config file. If the save fails, the process exits. This maintains backward compatibility with the old behavior (before 9.1). - `best-effort`: Synchronously save the config file. If the save fails, only log a warning and continue running. This allows the node to survive disk failures (e.g., disk full, read-only filesystem) without exitting, giving administrators time to address the issue. Note that this modifies the behavior of #1032, whereas #1032 was "best-effort", we have now introduced a configuration option that defaults to "sync." See #1032 discussion for more details. Background: When a disk becomes read-only or full, any cluster metadata change would trigger a nodes.conf save attempt. With the old behavior, the node would immediately exit via clusterSaveConfigOrDie(), potentially causing multiple nodes on the same machine to crash simultaneously, leading to cluster unavailability. The new `best-effort` mode addresses this by allowing nodes to continue operating even when disk writes fail. This is particularly useful in cloud environments where disk failures are more common due to scale. Note: Startup-time config saves (in clusterInit and verifyClusterConfigWithData) still use clusterSaveConfigOrDie() since disk issues at startup should cause immediate failure. Signed-off-by: Binbin <binloveplay1314@qq.com>	2026-04-10 10:40:44 +08:00
Ricardo Dias	c0289c6a72	Defer argument redaction by storing indices instead of rewriting argv (#3471 ) Previously, `redactClientCommandArgument()` forced a full copy of `c->argv` into `original_argv` immediately, then replaced the sensitive slot with `shared.redacted`. This meant any call to redact an argument triggered an eager argv backup even when no rewriting was needed. The new approach stores the indices of arguments to be redacted in a small dynamic array (`redact_args`) on the client struct. Redaction is applied lazily when the argv is actually consumed (e.g. by the command log), avoiding the unnecessary argv copy on the hot path. This change is required for the implementation of `ValkeyModule_CallArgv` module API function (#3122) , because CallArgv owns the `argv` array, which can't be modified by the redact function. --------- Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>	2026-04-09 16:42:57 +01:00
Roshan Khatri	b020b03c69	Increase timeouts in faster-failover test for slow CI runners (#3463 ) Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by #2227. Frequency: 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. Common errors: - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: best ranked replica` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt #3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>	2026-04-09 15:12:17 +02:00
charsyam	be52d8b81a	Fix trivial double-free issue in rdbLoadObject (#3453 ) In rdbLoadObject, when sdstrynewlen is OK and hashtableAdd is OK, but lpSafeToAdd is FAIL, field will be double-freed. But lpSafeToAdd will only fail when len + add > LISTPACK_MAX_SAFETY_SIZE(1GB), so it can rarely happened and trivial. Signed-off-by: charsyam <charsyam@naver.com>	2026-04-09 10:46:39 +08:00
Binbin	890b82ac33	Fix config rewrite producing negative values for unsigned memory configs (#3440 ) rewriteConfigFormatMemory() and rewriteConfigBytesOption() used long long parameters to represent memory bytes, but configs like maxmemory are stored as unsigned long long and allow values up to ULLONG_MAX. When the value exceeds LLONG_MAX (e.g. 9223372036854775808), it overflows to negative when passed as long long, causing config rewrite to produce incorrect output like "maxmemory -8589934592gb". Change both functions to use unsigned long long, matching the actual semantics of memory byte values. This is consistent with how numericConfigGet() already handles MEMORY_CONFIG using ull2string(). There are some MEMORY_CONFIG are signed and can be negative: 1. maxmemory-clients is a PERCENT_CONFIG 2. slot-migration-max-failover-repl-bytes is a SIGNED_MEMORY_CONFIG (after #3443) None of them were affected: ``` static void numericConfigRewrite(standardConfig config, const char name, struct rewriteConfigState *state) { ... if (config->data.numeric.flags & PERCENT_CONFIG && value < 0) { rewriteConfigPercentOption(state, name, -value, config->data.numeric.default_value); } else if (config->data.numeric.flags & SIGNED_MEMORY_CONFIG && value < 0) { rewriteConfigNumericalOption(state, name, value, config->data.numeric.default_value); } else if (config->data.numeric.flags & MEMORY_CONFIG) { rewriteConfigBytesOption(state, name, value, config->data.numeric.default_value); ... ``` Signed-off-by: Binbin <binloveplay1314@qq.com>	2026-04-09 10:43:16 +08:00

1 2 3 4 5 ...

13766 Commits