13778 Commits

Author SHA1 Message Date
Ran Shidlansik fea0b4064c Fix invalid memory access in RESTORE with malformed zipmap (CVE-2026-25243) (#3619)
Root cause: zipmapValidateIntegrity() and zipmapNext() use different
methods to calculate pointer advancement for length-encoded fields.
Validation reads the actual encoded size via
zipmapGetEncodedLengthSize() (which returns 5 for the 0xFE prefix), but
zipmapRawKeyLength() (used by zipmapNext during hash conversion)
recalculates via zipmapEncodeLength() which returns 1 for decoded
lengths < 254. A crafted zipmap with an overlong 5-byte encoding for a
small length passes validation but causes a 4-byte pointer mismatch in
zipmapNext(), leading to heap buffer over-reads during the
zipmap-to-listpack conversion.

Fix: add sanity checks in zipmapValidateIntegrity() to reject entries
where the decoded length < ZIPMAP_BIGLEN (254) but the encoding uses
more than 1 byte. This is applied to both field-name and value lengths.

Test: added a regression test in tests/unit/dump.tcl that crafts a
RESTORE payload with a 2-entry zipmap where the first field uses an
overlong 5-byte length encoding for value 3. Post-patch, this is cleanly
rejected by zipmapValidateIntegrity(). Pre-patch, the misaligned
zipmapNext() reads garbage (confirmed via server log: "Hash zipmap with
dup elements, or big length (0)") which also produces an error, so the
test serves as a defense-in-depth regression anchor rather than a strict
pass/fail differentiator. The actual heap over-read is detectable with
AddressSanitizer builds.

Signed-off-by: ikolomi <ikolomin@amazon.com>
Co-authored-by: ikolomi <ikolomin@amazon.com>
8.0.8
2026-05-05 16:48:38 -07:00
Ran Shidlansik c7c92db43b Delay full sync during yielding Lua scripts to prevent use-after-free (CVE-2026-23631) (#3625)
During a full sync, the functions/scripting engine is freed right before
loading the RDB from the primary. If a Lua script is still running and
yielding via the long-command mechanism at that moment, the freed engine
can be accessed when the script resumes, causing a use-after-free.

Add a guard at the top of replicaReceiveRDBFromPrimaryToMemory() to
check isInsideYieldingLongCommand() and return early, deferring the sync
processing until the script completes.

No validating test was added because the vulnerability is a race
condition between a yielding Lua script and a replication event handler,
which cannot be reliably triggered in a deterministic Tcl test.

Signed-off-by: ikolomi <ikolomin@amazon.com>
Co-authored-by: ikolomi <ikolomin@amazon.com>
2026-05-05 15:47:16 -07:00
sananes 797c626046 Fix SIGSEGV in VM_GetLRU/SetLRU/GetLFU/SetLFU on NULL key (#3610)
## Fix SIGSEGV in VM_GetLRU, VM_SetLRU, VM_GetLFU, VM_SetLFU on NULL key

### Description

`VM_GetLRU`, `VM_SetLRU`, `VM_GetLFU`, and `VM_SetLFU` crash with
SIGSEGV when passed a NULL `ValkeyModuleKey` pointer. This happens
because all four functions dereference `key->value` without first
checking if `key` itself is NULL.

When a module opens a nonexistent key in `VALKEYMODULE_READ` mode,
`VM_OpenKey` returns NULL. If a module passes that NULL pointer into any
of these functions, the server crashes.

### Reproduction

```
valkey-server --loadmodule tests/modules/misc.so
valkey-cli test.getlru nonexistent_key
# Server crashes: SIGSEGV (signal 11)
```

### Fix

**`src/module.c`** — Add a `!key` guard before dereferencing
`key->value` in all four functions:

```c
// Before:
if (!key->value) return VALKEYMODULE_ERR;

// After:
if (!key || !key->value) return VALKEYMODULE_ERR;
```

**`tests/modules/misc.c`** — Add early return after
`open_key_or_reply()` in `test_getlru`, `test_setlru`, `test_getlfu`,
and `test_setlfu`. The helper already sends the error reply to the
client when the key is not found, so the command handler just needs to
stop processing:

```c
ValkeyModuleKey *key = open_key_or_reply(ctx, argv[1], VALKEYMODULE_READ|VALKEYMODULE_OPEN_KEY_NOTOUCH);
if (!key) return VALKEYMODULE_OK;
```

### After fix

```
valkey-cli test.getlru nonexistent_key
(error) key not found
# Server stays up
```

Signed-off-by: Yaron Sananes <yaron.sananes@gmail.com>
2026-05-04 23:17:39 +03:00
Brad Bebee 8891441ab9 Fix checkPrefixCollisionsOrReply returning non-zero on self-overlap (#3583) 2026-05-03 11:44:09 -07:00
Sarthak Aggarwal f2f4e5dbfc Run ASan Tests on run-extra-tests label (#3512)
It's important to enabled ASAN on run-extra-tests label so we can
catch some of the bugs in the PRs before they are merged into unstable.

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2026-05-01 12:10:29 +08:00
Rain Valentine cea9354b56 Big Endian: add daily workflow UT job and fix UTs (#3330)
Big endian support on Valkey is "best effort" and not guaranteed, but we
haven't been doing any regular testing at all afaik. This PR adds a job
to the daily workflow to run UTs on an emulated big endian platform.
Integration tests failed excessively because of how slow emulation is.

I fixed several problems with tests and improved UT coverage of key
points where endian byte order matters - and fwiw I didn't find any
bugs. I think the main coverage gap remaining after this is RDB
serialization (maybe little endian <-> big endian round trips?)

There are couple lines of endian-specific code for #3166 and this change
can test it.

Signed-off-by: Rain Valentine <rsg000@gmail.com>
2026-05-01 12:09:23 +08:00
FAN PEI 46d37e4d5e Fix off-by-one boundary in lpEncodeBacklen() for 3 values (#3601)
The function lpEncodeBacklen() uses `<= 127` for the 1-byte case but `<
16383`, `< 2097151`, and `< 268435455` for the subsequent cases. This
means the exact values 16383, 2097151, and 268435455 (i.e. 2^14-1,
2^21-1, 2^28-1) unnecessarily use one extra byte than needed:

- `l < 16383` → `16383` (2^14-1) uses 3 bytes instead of 2
- `l < 2097151` → `2097151` (2^21-1) uses 4 bytes instead of 3
- `l < 268435455` → `268435455` (2^28-1) uses 5 bytes instead of 4

The decoding side (`lpDecodeBacklen`) is unaffected since it parses
continuation bits continuously without discrete range checks.

This is a correctness issue and has no impact on data integrity since
encoding and decoding use the same function boundaries, but it wastes up
to 1 byte per affected entry.

Signed-off-by: fanpei91 <fanpei91@gmail.com>
2026-05-01 12:06:38 +08:00
Saurabh K 54bdf5737b Handle NULL pointer in streamTrim listpack delta calculation (#3591)
When XTRIM marks the last entry in a listpack node as deleted, lpNext()
returns NULL after the lp-count field (EOF). The delta calculation (p -
lp) on a NULL pointer is undefined behavior and produces a garbage
pointer, corrupting the listpack. A subsequent XREAD hitting the
corrupted node triggers the lpValidateNext assertion failure and crashes
the server.

Guard the delta calculation with a NULL check so the while(p) loop
terminates naturally when the last entry is reached.

Fixes #3569

Signed-off-by: Saurabh Kher <saurabh@amazon.com>
Co-authored-by: Saurabh Kher <saurabh@amazon.com>
2026-05-01 12:05:01 +08:00
chenshi cba05103de Fix: prevent NULL dereference crash in connectSlotExportJob when target node disappears (#3596)
### Summary

This PR fixes a NULL pointer dereference (SIGSEGV) in
`connectSlotExportJob()`
(`src/cluster_migrateslots.c`) that can crash a Valkey cluster node,
causing a
denial-of-service condition.

### Root Cause

When `CLUSTER MIGRATESLOTS` is issued, a migration job is created with
state
`SLOT_EXPORT_CONNECTING`. On the next `clusterCron()` tick,
`proceedWithSlotMigration()` calls `connectSlotExportJob()`, which looks
up the
target node via `clusterLookupNode()`.

`clusterLookupNode()` can legitimately return `NULL` — for example, if
the
target node is removed from the cluster (e.g. via `CLUSTER FORGET`)
between the
time the migration job is created and the time the cron fires. This is a
realistic race condition in any cluster topology change scenario.

The return value was **never checked**, so the subsequent call to
`getNodeDefaultReplicationPort(n)` immediately dereferences the NULL
pointer,
crashing the process:

```c
// Before fix — vulnerable
clusterNode *n = clusterLookupNode(job->target_node_name, CLUSTER_NAMELEN);
int port = getNodeDefaultReplicationPort(n);  // SIGSEGV if n == NULL
serverLog(..., n->ip, port);                  // second dereference

Signed-off-by: chenshi5012 <chenshi5012@163.com>
2026-04-30 16:25:58 -07:00
Jeff Duffy 72fc5b14b1 Fix compilation error: replace deprecated je_calloc with zcalloc_num (#3592)
Fixes #1905

## Summary
The direct use of `je_calloc` in `src/allocator_defrag.c` causes
compilation failures on systems (e.g., Arch Linux with GCC 14.2.1) where
`calloc` is marked as deprecated and `-Werror=deprecated` is enabled.

## Changes
Replace the two `je_calloc` calls in `allocatorDefragInit()` with
`zcalloc_num`, which is the proper Valkey allocation wrapper that
provides the same semantics (num × size with zero-fill) without directly
invoking the deprecated `calloc` symbol.

## Testing
- Build compiles cleanly
- Integration tests pass (unit/memefficiency, defrag, unit/other — 51
passed, 0 failed)

Signed-off-by: jaduffy <jaduffy@amazon.com>
2026-04-30 15:59:19 -04:00
Jacob Murphy 81639e3975 fix: validate key count before allocating result in keyspec (#3598)
In `getKeysUsingKeySpecs`, when extracting keys based on the
`KSPEC_FK_KEYNUM `spec (like in the `EVAL` command), the server read the
number of keys from the arguments and calculated the expected end index.

However, it called `getKeysPrepareResult` to allocate memory for the
result array before validating whether last was within the bounds of the
actual arguments provided.

If a client sent a command with a huge declared number of keys (e.g.,
`COMMAND GETKEYS EVAL "return 1" 2147483647 key1`), the server would
allocate a massive amount of memory. Since `vm.overcommit_memory` is
recommended, this allocation would NOT normally have triggered OOM (we
never wrote to it so there is no physical memory allocated), but if you
disable overcommit, this could trigger an OOM.

You can reproduce it with:

```
$ prlimit --as=1073741824 src/valkey-server --save ""
...
384270:M 30 Apr 2026 04:27:24.456 * Ready to accept connections tcp

...
<in valkey-cli>
127.0.0.1:6379> command getkeys eval "return 1" 2147483647 key1

...
<in server log>
384270:M 30 Apr 2026 04:29:26.950 # Out Of Memory allocating 17179869176 bytes!
```

## Solution

* Moved the bounds check `if (last >= argc || last < first || first >=
argc)` to execute before the call to `getKeysPrepareResult`, preventing
the large allocation on invalid input.
* To further catch issues like this, protected against integer overflow
during the calculation of last by using a long long temporary variable.
If it exceeds INT_MAX or falls below INT_MIN, the spec is marked invalid
immediately.

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
2026-04-30 11:21:43 -07:00
abmathur-ie 7e2a2f7c4a fix(cluster): Remove per-call srand in clusterManagerNodePrimaryRandom (#3586)
clusterManagerNodePrimaryRandom() called srand(time(NULL)) on every
invocation, then immediately rand() % primary_count. When called in a
tight loop for uncovered slots, all calls within the same wall-clock
second produce the identical seed, causing every uncovered slot to be
assigned to the same primary node.

Remove the srand() call since the PRNG is already seeded at startup
(srand(time(NULL) ^ getpid()) at line 9838). This allows rand() to
advance its state across calls, distributing uncovered slots randomly
across available primaries.

---------

Signed-off-by: Abhishek Mathur <matshek@amazon.com>
Co-authored-by: Abhishek Mathur <matshek@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
2026-04-30 18:33:34 +03:00
Ping Xie 5b7ac66918 Fix verify-provenance action pin (#3594) 2026-04-29 21:30:40 -07:00
Ping Xie 98724dda08 Update provenance action to refine layer2 exemption policies (#3593) 2026-04-29 17:06:57 -07:00
abmathur-ie 678a06d216 Set errno on EOF in syncRead and propagate it in logs (#3580)
When read() returns 0 (EOF/connection closed) in syncRead(), errno is
not set by POSIX, so it retains a stale value (typically 0). This causes
callers using connGetLastError() to log strerror(0) which is the
misleading string "Success".

Set errno = ECONNRESET on EOF in syncRead(), matching the existing
pattern used for the timeout case (errno = ETIMEDOUT).

Also set conn->last_errno = errno in connSocketSyncWrite,
connSocketSyncRead, and connSocketSyncReadLine wrappers, matching the
pattern used by their async counterparts connSocketWrite and
connSocketRead.

After this fix, replica logs will show:
  "I/O error reading bulk count from PRIMARY: Connection reset by peer"
instead of the misleading:
  "I/O error reading bulk count from PRIMARY: Success"

---------

Signed-off-by: Abhishek Mathur <matshek@amazon.com>
Signed-off-by: djk1027 <djk9510271@gmail.com>
Co-authored-by: Abhishek Mathur <matshek@amazon.com>
Co-authored-by: Daejun Kim <djk9510271@gmail.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
2026-04-29 14:13:40 -07:00
bandalgomsu 7817ca8a73 Fix GEOSEARCH BYPOLYGON leak on invalid COUNT (#3568)
Free BYPOLYGON points before returning from invalid COUNT parsing paths
in GEOSEARCH/GEOSEARCHSTORE.

Closes #3567

---------

Signed-off-by: Su Ko <rhtn1128@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2026-04-29 13:39:12 -04:00
Jim Brunner ad404cd266 fix compile warning in util.c (#3585)
Address this compile warning:
```c
    CC util.o
util.c:638:1: warning: ‘no_sanitize’ attribute directive ignored [-Wattributes]
 __attribute__((no_sanitize_address, no_sanitize("thread"), used)) static int (*string2ll_resolver(void))(const char *, size_t, long long *) {
 ^~~~~~~~~~~~~
```
Addresses portability concerns around these attributes.

---------

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
2026-04-29 09:31:04 -07:00
VoletiRam 39036c7c06 Add structured datasets loading capability in valkey benchmark (#2823)
## Background

Add structured datasets loading capability. Support CSV and TSV file
formats. Use `__field:fieldname__` placeholders to replace the
corresponding fields from the dataset file. Support natural content size
of varying length. Allow mixed placeholder usage combining dataset
fields with random generators. Enable automatic field discovery from
CSV/TSV headers. Use `--maxdocs` to limit the dataset loading.

Rather than modifying the existing placeholder system, we detect field
placeholders and switch to a separate code path that builds commands
from scratch using `valkeyFormatCommandArgv()`. This ensures:

- Zero impact on existing functionality
- Full support for variable-size content
- Thread-safe atomic record iteration
- Compatible with pipelining and threading modes

__Usage examples__

```sh
# Strings - Simple key-value with dataset fields
./valkey-benchmark --dataset products.csv -n 10000 SET product:__rand_int__ "__field:name__"

# Sets - Unique collections from dataset
./valkey-benchmark --dataset categories.csv -n 10000 SADD tags:__rand_int__ "__field:category__"

# CSV dataset with document limit
./valkey-benchmark --dataset wiki.csv --maxdocs 100000 -n 50000 HSET doc:__rand_int__ title "__field:title__" body "__field:abstract__"

# Mixed placeholders (dataset + random)
./valkey-benchmark --dataset terms.csv -r 5000000 -n 50000 HSET search:__rand_int__ term "__field:term__" score __rand_1st__
```

__Full-Text Search Benchmarking__

```sh
# Search hit scenarios (existing terms)
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Search miss scenarios (non-existent terms)
./valkey-benchmark --dataset miss_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__"

# Query variations
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "@title:__field:term__"
./valkey-benchmark --dataset search_terms.csv -n 50000 FT.SEARCH rd0 "__field:term__*"
```

__Benchmark Results__


Test environment:
__Instance:__ AWS c7i.16xlarge, 64 vCPU

Test Dataset: 5M+ Wikipedia XML documents, 5.8GB memory

| Configuration | Throughput | CPU Usage | Wall Time | Memory Peak |
|---------------|------------|-----------|-----------|-------------|
| Single-threaded, P1 | 93,295 RPS | 99% | 71.4s | 5.8GB |
| Multi-threaded (10), P1 | 93,332 RPS | 137% | 71.5s | 5.8GB |
| Single-threaded, P10 | 274,499 RPS | 96% | 36.1s | 5.8GB |
| Multi-threaded (4), P10 | 344,589 RPS | 161% | 32.4s | 5.8GB |

---------

Signed-off-by: Ram Prasad Voleti <ramvolet@amazon.com>
Co-authored-by: Ram Prasad Voleti <ramvolet@amazon.com>
2026-04-29 09:18:37 -07:00
Jim Brunner 16ed690fec fix LTO compilation warning in eval (#3584)
I noticed this LTO compile warning in the eval code. Looks like it's
getting confused about an sds length, even though checked above. Just
added an assert to clarify.

The warning:
```c
    LINK valkey-server
eval.c: In function ‘evalExtractShebangFlags’:
eval.c:263:27: warning: argument 1 value ‘18446744073709551615’ exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]
             *out_engine = zcalloc(engine_name_len + 1);
                           ^
zmalloc.c:324:7: note: in a call to allocation function ‘valkey_calloc’ declared here
 void *zcalloc(size_t size) {
       ^
```

Signed-off-by: Jim Brunner <brunnerj@amazon.com>
2026-04-28 23:13:56 -07:00
Binbin bef46dacc1 Skip cluster resharding test under valgrind (#3574)
This change was introduced in #3382. This test is already very slow on
its own. Under valgrind it gets slow enough that the per-node restart
step lets primaries be marked FAIL and triggers failovers, after which
"Verify slaves consistency" no longer holds since it assumes the original
topology.

It was never run under valgrind before and exercises nothing valgrind
meaningfully covers, so just tag it valgrind:skip.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2026-04-29 10:27:04 +08:00
Daejun Kim 8091c6c10a Remove redundant count division in genericHgetallCommand (#3573)
The argument `count /= 2` modifies `count` as a side effect, and the
following `count /= 2` divides it again unnecessarily.
Since `count` is not used after this point, fix it by using `count / 2`
without the side effect and remove the redundant second assignment.

Signed-off-by: djk1027 <djk9510271@gmail.com>
2026-04-28 11:43:56 +03:00
Jun Yeong Kim ff80b2d1dc Migrate the remaining cluster tests to the new framework and remove legacy files (#2297) (#3382)
Migrated the remaining cluster tests to tests/unit/cluster/ to use the same
framework for all cluster tests. Cleaned up the obsolete cluster test framework
files and updated the CI workflows to use the new unified test runner.

Changes:
  Moved and mapped 6 test files:
  - 03-failover-loop.tcl → Merged into existing failover.tcl
  - 04-resharding.tcl → resharding.tcl
  - 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl →
  replica-migration-slow.tcl
  - 07-replica-migration.tcl → Merged into existing replica-migration.tcl
  - 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl

Other changes:
  - Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv)
  - Added process_is_alive check in cluster_util.tcl to fix an exception in
  failover tests caused by executing ps on dead processes
  - Heavy tests (resharding, replica-migration-slow) marked with slow tag and
  wrapped in run_solo to prevent resource contention in sanitizer environments
  - replica-migration-slow marked with valgrind:skip tag since it is very slow
  - Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl,
  includes/, and helpers/
  - Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@")
  - Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster
  tests are now included in ./runtest

Closes #2297.

Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2026-04-27 17:31:37 +08:00
eifrah-aws 6dbb7f81a9 Fix remove cached eval scripts on engine unregister (#3503)
Remove eval script cache entries that belong to a scripting engine when
that engine is unregistered. This prevents the eval cache from retaining
dangling engine pointers and keeps the tracked script memory in sync
after engine shutdown.

The scripting engine unregister path now invokes a new eval cleanup
helper, which scans the cached scripts, drops matching entries from the
LRU list and dictionary, and adjusts cache memory accounting accordingly.

* scripting engine
* eval cache

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
2026-04-27 11:29:20 +08:00
Jacob Murphy 28ecbd204f Ensure client slot migration pointer is cleared during reset (#3554)
If not cleared, the job may no longer be valid by the time the client
goes to cleanup. This dangling reference could cause a crash if you set
slot-migration-log-max-len to 0 and are very unlucky.

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
2026-04-27 11:05:35 +08:00
Binbin a3e44a55d3 Fix lua-enable-insecure-api default value cannot be changed to yes (#3548)
The default value of lua-enable-insecure-api cannot be safely changed
from no to yes due to two issues:

1. In createEngineContext(), lua_enable_insecure_api was hardcoded to 0
before initializing Lua states, so deprecated APIs (newproxy, setfenv,
   getfenv) were never registered in the global table regardless of the
   actual config value. Once the global table is locked, the config
   change has no effect.

2. lua_insecure_api_current was initialized to 0 (struct zero-init) and
   never synced with the final config value. If the default was changed
   to yes(1), a subsequent CONFIG SET no would see both values as 0 and
   skip the evalReset() call in updateLuaEnableInsecureApi().

Fix by reading the real config via isLuaInsecureAPIEnabled() in
createEngineContext() before Lua state initialization, and syncing
lua_insecure_api_current after all config sources (default, config file,
command-line args) are applied.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2026-04-27 11:04:14 +08:00
Binbin ac9ca9de3d Fix rdmaServer leaks when create listen cm id error (#3557)
In here we should go to error to free the resources:
```
error:
    if (listen_cmid) rdma_destroy_id(listen_cmid);
    if (listen_channel) rdma_destroy_event_channel(listen_channel);
    ret = ANET_ERR;

end:
    freeaddrinfo(servinfo);
    return ret;
}
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
2026-04-27 11:01:52 +08:00
charsyam bb88665578 hashtable: fix dismissHashtable madvise size (#3533)
The bug was in dismissHashtable(), which computes the size passed to
zmadvise_dontneed() for the top-level hashtable
  tables.

ht->tables[i] points to a contiguous array of bucket objects, but the
code used sizeof(bucket *) instead of
sizeof(bucket) when calculating the length. That means it treated the
allocation like an array of pointers rather than an
  array of buckets.

As a result, the advised range was much smaller than the actual table
allocation. On 64-bit builds, bucket is 64 bytes
while bucket * is 8 bytes, so only about one eighth of the table was
covered. This does not usually break correctness,
but it defeats the purpose of the function: after a fork, we want to
tell the kernel that the hashtable pages are no
longer needed so we reduce copy-on-write overhead. With the wrong size,
most of the table memory was never included in
  that hint.

The fix is to use sizeof(bucket) so the full top-level bucket array is
passed to zmadvise_dontneed().

Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
2026-04-26 19:45:50 -07:00
Hanxi Zhang edc0d26ada Strip LTO flags from static Lua module build (#3555)
### Summary

The daily CI sanitizer jobs with clang are failing during the build
step.
When the static Lua module is built with `-flto`, the `.o` files contain
LLVM bitcode that gets archived into `libvalkeylua.a`. The system linker
cannot read this bitcode, causing build failures:

`/usr/bin/ld:
/home/runner/work/valkey/valkey/src/modules/lua/libvalkeylua.a: member
/home/runner/work/valkey/valkey/src/modules/lua/libvalkeylua.a(debug_lua.o)
in archive is not an object`

The previous fix (#3546) pinned clang to version 17, but this was
insufficient, the issue is not just a version mismatch but that the
system linker fundamentally cannot read LTO bitcode from `.a` archives.

Example failure:
https://github.com/valkey-io/valkey/actions/runs/24865821147/job/72801509768

### Fix

Strip LTO flags from OPTIMIZATION in the Lua module Makefile using
  `override`


Tested: https://github.com/hanxizh9910/valkey/actions/runs/24913834442

---------

Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
2026-04-26 19:28:21 -07:00
Ping Xie c861184762 Implement Provenance Guard (#3109)
This PR bootstraps Valkey's provenance guard integration.

The provenance guard is a content-based similarity detection system that helps maintain proper code provenance by comparing incoming PR changes against fingerprint databases built from Redis commits and PRs. The matching logic now lives in the external `valkey-io/verify-provenance` action repository; this PR wires Valkey to that action and seeds the required database branch.

Key features:
  * Content-based detection: Uses normalized diff fingerprints and fuzzy matching to detect similar changes, including cases where files have moved or been refactored.
  * Externalized action logic: The check and refresh implementation is maintained in `valkey-io/verify-provenance` and is pinned by exact commit SHA from Valkey workflows.
  * Provenance Guard workflow: Runs on PR activity to check incoming changes against the provenance databases and report potential matches.
  * Daily Refresh workflow: Runs daily to refresh PR fingerprints and commits updated data back to `verify-provenance-db`.
  * Dedicated DB branch: Stores provenance databases on the orphan `verify-provenance-db` branch, separate from Valkey source code.
  * Privacy-first storage: Stores compressed non-reversible fingerprints, not source code.

The initial `verify-provenance-db` branch has been bootstrapped with fingerprints of Redis commits and PRs.

  ---------

  Signed-off-by: Ping Xie <pingxie@outlook.com>
2026-04-26 14:36:18 -07:00
Rain Valentine a7d495352a extra UT
Signed-off-by: Rain Valentine <rsg000@gmail.com>
2026-04-24 15:40:54 -07:00
Rain Valentine 54980ece3a hashtable iterator safety: invalidate on exhaustion
Signed-off-by: Rain Valentine <rsg000@gmail.com>
2026-04-24 15:40:54 -07:00
Sarthak Aggarwal d2db0c268c Fix module commandresult event cleanup during unsubscribe and module unload (#3545)
This follows up on the commandresult API work and fixes cleanup around
unsubscribe and module unload.

The main issue was that command-result event listeners could leave stale
state behind. On unload, we removed the listeners themselves but didn’t
fully update the fast-path listener counters. Separately, unsubscribing
with a NULL callback could behave badly if the listener wasn’t present
anymore. In practice, that meant later commands could still walk into
command-result event handling after the module was supposed to be
cleaned up.

Failed in Daily as well yesterday:
https://github.com/valkey-io/valkey/actions/runs/24753491944/job/72421581610#step:10:852
Related Failures:
https://github.com/valkey-io/valkey/pull/2936#issuecomment-4290490199

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2026-04-23 19:10:20 -07:00
Harkrishn Patro cb2cfdd4e0 Revert "Pin clang to version 17 in sanitizer CI jobs" (#3556)
Reverts valkey-io/valkey#3546

This didn't help fix the build issue. Follow up PR is performed on
https://github.com/valkey-io/valkey/pull/3555

Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
2026-04-23 19:01:52 -07:00
Hanxi Zhang 7db5b70737 Pin clang to version 17 in sanitizer CI jobs (#3546)
### Analysis
The daily CI sanitizer jobs with clang are failing during the build
step.

The `ubuntu-latest` runner now has clang 18, but the LLVM gold plugin
is still version 17. When the static Lua module is built with `-flto`,
the `.o` files contain LLVM 18 bitcode that the gold plugin (v17) cannot
read:
`bfd plugin: LLVM gold plugin has failed to create LTO module: Unknown
attribute kind (91) (Producer: 'LLVM18.1.3' Reader: 'LLVM 17.0.6')
`
Example failure:
https://github.com/valkey-io/valkey/actions/runs/24753491944/job/72421581512

### Fix
Pin the sanitizer jobs to `clang-17` so the compiler and gold plugin
versions match.
Tested(successfully built):
https://github.com/hanxizh9910/valkey/actions/runs/24859845008

### Note
If `clang-17` is removed from the `ubuntu-latest` image in the future,
we may need to either add an explicit install step

Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
2026-04-23 16:14:24 -07:00
Sarthak Aggarwal 5abf79e0e3 Add zmalloc_aligned() and fix SPMC queue buffer alignment (#3504)
The SPMC queue from #3324 needs each `spmcCell` to be cache-line
aligned, but plain `zmalloc()` does not guarantee that in all build
configurations.

This change introduces `zmalloc_cache_aligned()` and uses it for the
SPMC queue buffer allocation in `spmcInit()`.

Failing CI: https://github.com/valkey-io/valkey/actions/runs/24374139344

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
2026-04-23 11:46:22 -07:00
charsyam 9709843446 Optimize HGETDEL to pause auto shrink when deleting multiple items (#3535)
Match HGETDEL with the existing batch-delete pattern used by HDEL.

HDEL already pauses hashtable auto-shrink while deleting multiple fields
so shrink evaluation is deferred until the batch completes. HGETDEL was
missing the same optimization even though it also deletes fields in a loop.

Pause auto-shrink for hashtable-encoded hashes before the HGETDEL delete
loop and resume it once afterwards. This preserves observable behavior
and reduces redundant shrink work for multi-field deletes.

Same as #3144.

Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
2026-04-23 12:56:52 +08:00
Madelyn Olson 651c40a89e Fix FD leak in connSocketBlockingConnect on timeout (#3541)
## Summary
Fix a file descriptor leak in `connSocketBlockingConnect()` when
`aeWait()` times out.

## Bug
When `anetTcpNonBlockConnect()` succeeds but `aeWait()` times out (e.g.,
MIGRATE to an unreachable host), the fd is leaked because it was never
assigned to `conn->fd`. The caller's `connClose()` checks `conn->fd !=
-1` and skips cleanup.

## Fix
Assign `conn->fd = fd` immediately after `anetTcpNonBlockConnect()`
succeeds, before `aeWait()`. This way the caller's normal `connClose()`
cleanup path handles the fd on any error, which is consistent with how
the rest of the connection lifecycle works.

TLS connections also benefit since `connTLSBlockingConnect` delegates to
this function for the TCP layer.

## Reproducer
```
valkey-cli SET key hello
# Repeat against unreachable host:
for i in $(seq 1 30); do valkey-cli MIGRATE 192.0.2.1 6379 key 0 500; done
# Check: /proc/<pid>/fd shows 30 leaked socket fds
```

*This issue was generated by AI but verified, with love, by a human.*

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2026-04-23 12:34:34 +08:00
Binbin c403eecd5b Fix double free in stream consumer PEL loading with corrupt RDB data (#3498)
There is a double free issue in the code. The error handling path called
both decrRefCount(o) and streamFreeNACK(nack), but the nack was obtained
from cgroup->pel via raxFind and is still referenced there. decrRefCount(o)
frees it through freeStream -> streamFreeCG -> raxFreeWithCallback(cg->pel, zfree),
so the explicit streamFreeNACK(nack) causes a double free.

Remove the redundant streamFreeNACK(nack) call and add a regression
test with a crafted corrupt payload that triggers the duplicate consumer
PEL entry path.

This was introduced in 492d8d0961.

Signed-off-by: Binbin <binloveplay1314@qq.com>
2026-04-23 12:32:10 +08:00
Roshan Khatri 04896c1e6d Deflake many-slot-migration under valgrind (#3462)
## Problem
`Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been
timing out daily on valgrind jobs since April 3, 2026. The test runs 10
cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots —
too much work for valgrind-instrumented builds.

The slowdown is caused by #3366 (dict→hashtable wrapper). Under `-O0`
(valgrind builds), the `static inline` wrappers become real function
calls that valgrind instruments, adding ~75% overhead to hot paths like
`dictSize`. This compounds across 10 valgrind processes over a 20-minute
migration test. No impact on production builds (`-O2` inlines everything).

## Fix
Scale the test workload down under valgrind: 10,000 keys / 250 slots
instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises
the same cluster repair path.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>
2026-04-23 12:31:32 +08:00
Deepak Nandihalli 3ab9d9797e Fix race condition during async client freeing with IO threading enabled (#3458)
When close_asap flag is set, set bytes read to 0
    
In the readToQueryBuf, the c->nread represents the number of bytes read.
When close_asap flag is set, there is a bug where the c->nread isn't
reset to 0 and this breaks the invariant. IOThreads then incorrectly
think there is data to read and results in a crash. This change fixes
this bug.

To elaborate on the race possible:

1. Let's say that a IO thread job for reading query from a client got
enqueued as part of a epoll -
https://github.com/valkey-io/valkey/blob/unstable/src/io_threads.c#L417.
2. Later the client gets freed async and is marked as close_asap -
https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L2175
3. While processing the io_thread job for the client, it invokes
iothreadReadQueryFromClient. Here,
[`readToQueryBuf`](https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L6497)
returns as a no-op since the client is marked close-asap. Also, the
c->nread is not reset to 0 and count contain the value from a previous
read.
4. Later parseInputBuffer [gets
invoked](https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L6514).
5. The parseInputBuffer then [accesses the
query_buf](https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L3864).
The query_buf here would be null in resetSharedQueryBuf as part of
beforeNextClient.

Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
2026-04-22 17:39:44 -07:00
Sarthak Aggarwal 03c2d4c2a2 Stabilize diskless no-drop replication test (#3511)
This deflakes all variants of `diskless replicas drop during rdb pipe`.

The main issue turned out to be that the test was too sensitive to
timing and log ordering under TLS, not that the core behavior was wrong.
This keeps the same five subcases (no, slow, fast, all, timeout) but
makes them much less CI-fragile.

CI passes 200 times:
https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
2026-04-22 00:14:18 +02:00
Yang Zhao fc00f7be03 Fix VLA warning in io_threads (#3518)
https://github.com/valkey-io/valkey/pull/3324 introduced `BATCH_SIZE` as
a const int local variable and used it as an array bound. Clang 17
rejects this with:
```
io_threads.c:305:22: error: variable length array folded to constant array as an extension [-Werror,-Wgnu-folding-constant]
  305 |     void *batch_jobs[BATCH_SIZE];
      |                      ^~~~~~~~~~
1 error generated.
make[1]: *** [io_threads.o] Error 1
make: *** [all] Error 2

```
Old Clang versions do not emit this warning, maybe that is why the CI
passed. Fix by promoting `BATCH_SIZE` to a file-scope `#define`.

Signed-off-by: Yang Zhao <zymy701@gmail.com>
2026-04-21 13:07:58 -07:00
martinrvisser 6444717517 Module command result callback addition (#2936)
## Add Command Result Event Notifications for Modules

### Summary

1. Adds new server events `ValkeyModuleEvent_CommandResultSuccess` and
`ValkeyModuleEvent_CommandResultFailure` for that can notify subscribed
modules after command execution. This enables modules to implement audit
logging, error monitoring, performance tracking, and observability
without modifying core server code.
2. Adds new server event `ValkeyModuleEvent_CommandResultACLDenied` for
commands rejected by ACL. Together with PR #2237 this covers auditing of
authentication and authorisation.

### Motivation

There is currently no module API to observe command outcomes after
execution or to capture ACL denied commands. Modules that need audit
logging or error monitoring have no mechanism to be notified when
commands succeed or fail, what arguments were used, how long they took,
or how many keys were modified. This feature fills that gap using the
existing `ValkeyModule_SubscribeToServerEvent()` infrastructure.

### API

#### Events

| Event | Description |
|---|---|
| `ValkeyModuleEvent_CommandResultSuccess` | Fired after a command
completes successfully |
| `ValkeyModuleEvent_CommandResultFailure` | Fired after a command
returns an error |
| `ValkeyModuleEvent_CommandACLDenied` | Fired after a command is
rejected by ACL |

These are separate events (not sub-events), so modules can for example
only subscribe to failures without incurring any callback overhead for
successful commands.

#### Event Data: `ValkeyModuleCommandResultInfo`

The `data` pointer passed to the callback can be cast to
`ValkeyModuleCommandResultInfo`:

```c
typedef struct ValkeyModuleCommandResultInfo {
    uint64_t version;           /* Version of this structure for ABI compat. */
    const char *command_name;   /* Full command name (e.g., "SET", "CLIENT|LIST"). */
    long long duration_us;      /* Execution duration in microseconds. */
    long long dirty;            /* Number of keys modified. */
    uint64_t client_id;         /* Client ID that executed the command. */
    int is_module_client;       /* 1 if command was from RM_Call, 0 otherwise. */
    int argc;                   /* Number of command arguments. */
    ValkeyModuleString **argv;  /* Command arguments array (zero-copy, read-only). */
    int acl_deny_reason;        /* ACL_DENIED_CMD/KEY/CHANNEL/AUTH; 0 for non-ACL events */
    const char *acl_object;     /* Denied resource name (key/channel); NULL for CMD/AUTH */
} ValkeyModuleCommandResultInfoV1;
```

The struct is versioned (`VALKEYMODULE_COMMANDRESULTINFO_VERSION`) for
forward-compatible API evolution.

### Usage Example

```c
/* Callback receives events for whichever event(s) you subscribed to */
void OnCommandResult(ValkeyModuleCtx *ctx, ValkeyModuleEvent eid,
                     uint64_t subevent, void *data) {
    VALKEYMODULE_NOT_USED(ctx);
    VALKEYMODULE_NOT_USED(subevent);

    ValkeyModuleCommandResultInfo *info = (ValkeyModuleCommandResultInfo *)data;
    if (info->version != VALKEYMODULE_COMMANDRESULTINFO_VERSION) return;

    int failed = (eid.id == VALKEYMODULE_EVENT_COMMAND_RESULT_FAILURE);

    /* Access fields directly */
    printf("command=%s status=%s duration=%lldus dirty=%lld client=%llu\n",
           info->command_name,
           failed ? "FAIL" : "OK",
           info->duration_us,
           info->dirty,
           info->client_id);

    /* Access argv (read-only, zero-copy) */
    for (int i = 0; i < info->argc; i++) {
        size_t len;
        const char *arg = ValkeyModule_StringPtrLen(info->argv[i], &len);
        printf("  argv[%d] = %.*s\n", i, (int)len, arg);
    }
}

/* Subscribe in ValkeyModule_OnLoad or at runtime */

/* Option A: command failures only (recommended for audit logging) */
ValkeyModule_SubscribeToServerEvent(ctx,
    ValkeyModuleEvent_CommandResultFailure, OnCommandResult);

/* Option B: command successes only */
ValkeyModule_SubscribeToServerEvent(ctx,
    ValkeyModuleEvent_CommandResultSuccess, OnCommandResult);

/* Option C: both command outcomes*/
ValkeyModule_SubscribeToServerEvent(ctx,
    ValkeyModuleEvent_CommandResultSuccess, OnCommandResult);
ValkeyModule_SubscribeToServerEvent(ctx,
    ValkeyModuleEvent_CommandResultFailure, OnCommandResult);

/* Subscribe to ACL Denied */
ValkeyModule_SubscribeToServerEvent(ctx,
        ValkeyModuleEvent_CommandResultACLDenied, onCommandResult);

/* Unsubscribe pass NULL callback */
ValkeyModule_SubscribeToServerEvent(ctx,
    ValkeyModuleEvent_CommandResultFailure, NULL);
```

### Design Decisions

- **Separate events instead of sub-events**: Modules subscribing only to
failures have zero overhead for successful commands (~2ns listener-list
check vs ~30ns callback invocation per command). This is critical since
success events fire on the hot path of every command.
- **Stack-allocated info struct**: The `ValkeyModuleCommandResultInfoV1`
is built on the stack ΓÇö no heap allocation per event.
- **Zero-copy argv**: Arguments are passed directly from the client's
argv array. Any integer-encoded arguments (from `tryObjectEncoding()`
during command execution) are decoded to string-encoded objects before
being passed to the callback, ensuring compatibility with
`ValkeyModule_StringPtrLen()`.
- **Early exit**: If no modules are subscribed to any server events, the
event firing function returns immediately before building the info
struct.
- **Uses existing server event infrastructure**: Follows the
`ValkeyModule_SubscribeToServerEvent()` pattern used by all other server
events, rather than introducing a new callback mechanism.

### Files Changed

| File | Change |
|---|---|
| `src/valkeymodule.h` | Event IDs, event constants,
`ValkeyModuleCommandResultInfoV1` struct |
| `src/module.c` | `moduleFireCommandResultEvent()`, event
documentation, event version entries |
| `src/module.h` | Function declaration |
| `src/server.c` | Call `moduleFireCommandResultEvent()` from `call()`
after command execution |
| `src/server.c` | Call to `moduleFireCommandACLDeniedEvent` in
`processCommand` after ACL rejection |
| `tests/modules/commandresult.c` | Test module exercising the full API
|
| `tests/unit/moduleapi/commandresult.tcl` | Integration tests |

---------

Signed-off-by: martinrvisser <mvisser@hotmail.com>
Signed-off-by: martinrvisser <martinrvisser@users.noreply.github.com>
Co-authored-by: Ricardo Dias <rjd15372@gmail.com>
2026-04-21 09:14:14 -04:00
Dietrich Daroch 9d51f5ff8a Document VALKEYCLI_HOST/PORT variables in help (#3520)
Follow-up to #3402 as we missed documenting this.

---

Signed-off-by: Dietrich Daroch <Dietrich@Daroch.me>
2026-04-21 11:52:54 +02:00
eifrah-aws 0327c27131 Add Static Module Support (#3392)
Add a build option to compile the Lua scripting engine as a static
module and wire the server to load it directly at startup when enabled.
The module load path now resolves on-load and on-unload entry points
from the main binary, and the module lifecycle keeps those callbacks so
unload works without a shared library handle.

The Lua module build was updated to support both static and shared
variants, with the static path exporting visible wrapper symbols and
linking the server with the module archive. While touching the Lua code,
a few internal symbols were renamed for consistency and the monotonic
time helper was clarified.

Note that this PR addresses the LUA module, but it can be applied to
other "core" modules (like: Bloom, Json, Search and others). With this
change, it will be easier to ship Valkey bundle with modules.

Areas touched:

* CMake
* Makefile
* Lua scripting module
* Core module loading

**Generated by CodeLite**

---------

Signed-off-by: Eran Ifrah <eifrah@amazon.com>
2026-04-20 14:45:57 +03:00
Daniil Kashapov 269b1c5eda Improve COB memory tracking with copy avoidance (#3306)
This improves COB memory tracking when using copy avoidance for bulk
string replies. This fix addresses underestimation of client memory
usage that occurred when reply buffers stored pointers to shared `robj`
instead of copying data.
IO threads calculate actual reply sizes by calling `sdslen()` on strings
before writing, for that we need atomic `tracked_for_cob` flag in
payload headers to prevent race conditions and double accounting.

See #2396

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
2026-04-20 14:45:51 +03:00
Madelyn Olson 4a42c95853 Fix HPERSIST RESP protocol violation on wrong-type key (#3516)
`hpersistCommand` calls `addReplyArrayLen` before `lookupKeyWrite` +
`checkType`. When HPERSIST targets a non-hash key, the server writes a
RESP array header followed by a WRONGTYPE error — a malformed response
that permanently desynchronizes the client connection.

This moves `lookupKeyWrite` + `checkType` before `addReplyArrayLen`,
matching the pattern used by every other HFE command (e.g.
`hgetdelCommand`, `hexpireGenericCommand`).

Added a test for HPERSIST on a wrong-type key.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
2026-04-17 17:02:04 -07:00
Sarthak Aggarwal 109ef346f3 [Flaky Tests] Avoid re-triggering io-thread activation (#3509)
The test was accidentally waking the IO threads while trying to check
that they had gone idle.

After the recent IO-thread refactor in #3324, the
[test](https://github.com/valkey-io/valkey/pull/3324/changes#diff-21314ec3a338f739eab1536f91f528d1efe7c6a93935a71b9c02f77a3858f121R112)
started forcing `io-threads-always-active`, and its repeated `INFO`
polling counted as fresh activity. So instead of just observing the
worker threads, the test kept reactivating them and then flaked.

---------

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>
2026-04-17 17:00:43 -07:00
Roshan Khatri b2d08c9ef9 Fix use-after-unload crash in test auth module's blocking thread (#3464)
## Problem

The test `Test module aof save on server start from empty` in
`tests/unit/moduleapi/hooks.tcl` sporadically crashes with `I/O error
reading reply`.

**Frequency:** 2 out of 15 days (March 26 on
`centosstream9-tls-module-no-tls`, April 8 on `fedorarawhide-jemalloc`).

**Example failing run:**
https://github.com/valkey-io/valkey/actions/runs/24110987718/job/70345236353

## Root Cause

The crash is a **use-after-unload** in the auth test module's blocking
authentication thread, NOT a timing issue in the AOF test.

The crash log from April 8 shows:
```
71112:M 00:42:59.710 * Module testacl unloaded
71112:M 00:42:59.711 # crashed by signal: 11, si_code: 1
71112:M 00:42:59.711 # Crashed running the instruction at: 0x7f9dc717384b
```

The sequence:
1. `blocking_auth_cb` spawns a background thread
(`AuthBlock_ThreadMain`) that sleeps 500ms
2. Thread wakes, calls `ValkeyModule_UnblockClient()` → main thread
processes unblock, decrements `module->blocked_clients`
3. Auth command completes, test calls `r module unload testacl`
4. `moduleUnloadInternal` checks `blocked_clients == 0` if true,
proceeds with `dlclose()`
5. **But the background thread is still executing cleanup code**
(freeing strings, returning from function)
6. Thread returns into unmapped memory → **SIGSEGV**

The `invalidFunctionWasCalled` in the stack trace is the crash handler's
safety stub, and the crashing address `0x7f9dc717384b` is in the
unmapped auth.so address space.

## Fix

Track the background thread ID and `pthread_join()` it in
`ValkeyModule_OnUnload` before the module is dlclose'd. This ensures the
thread has fully exited before the code is unmapped.

The key insight is that `ValkeyModule_UnblockClient()` signals "auth is
done" but not "thread is done" — the thread still has cleanup code to
execute after that call. `pthread_join()` is the correct synchronization
point because it only returns after the thread has fully exited.

No mutex is needed since both `blocking_auth_cb` (which creates the
thread) and `OnUnload` (which joins it) run on the main event loop
thread.

Changes to `tests/modules/auth.c`:
- Add global `blocking_auth_tid` and `blocking_auth_tid_valid` flag
- Set `blocking_auth_tid_valid = 1` after successful `pthread_create`
- In `OnUnload`, `pthread_join` the thread if one was created

## Testing

Ran `unit/moduleapi/hooks` 100 loops on rpm-distros and ubuntu runners —
**all passed**:
- **Workflow run:**
https://github.com/roshkhatri/valkey/actions/runs/24164276124
- **Config:** `--loops 100 --single unit/moduleapi/hooks` on
`almalinux8`, `almalinux9`, `fedoralatest`, `fedorarawhide`,
`centosstream9`, `ubuntu-jemalloc`, `ubuntu-arm`
- **Result:** 7/7 jobs , zero failures across 700 total test iterations

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
2026-04-17 10:05:03 -07:00
Viktor Söderqvist 8a91a12398 Unique samples in hashtableSampleEntries (#3460)
Instead of "scanning" random bucket chains using a random cursor for
each scan call, start at a random cursor and then continue sampling
buckets in scan order. The scan stops when we have sampled all elements,
so the cursor never wraps around to sample the same buckets again. This
ensures that we don't get any duplicate samples.

The functions hashtableRandomEntry and hashtableFairRandomEntry keeps
the old behavior using another sampling function preserving their
behavior. The fairness tests fail if we use the modified
hashtableSampleEntries.

This restores the behavior of dictGetSomeKeys (which is now an alias of
hashtableSampleEntries) and deflakes the test case:

Gossip count scales with higher percentage of
`cluster-message-gossip-perc`
    in tests/unit/cluster/packet.tcl

Fixes #3454

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2026-04-16 16:53:47 +02:00