It's important to enabled ASAN on run-extra-tests label so we can
catch some of the bugs in the PRs before they are merged into unstable.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Big endian support on Valkey is "best effort" and not guaranteed, but we
haven't been doing any regular testing at all afaik. This PR adds a job
to the daily workflow to run UTs on an emulated big endian platform.
Integration tests failed excessively because of how slow emulation is.
I fixed several problems with tests and improved UT coverage of key
points where endian byte order matters - and fwiw I didn't find any
bugs. I think the main coverage gap remaining after this is RDB
serialization (maybe little endian <-> big endian round trips?)
There are couple lines of endian-specific code for #3166 and this change
can test it.
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Migrated the remaining cluster tests to tests/unit/cluster/ to use the same
framework for all cluster tests. Cleaned up the obsolete cluster test framework
files and updated the CI workflows to use the new unified test runner.
Changes:
Moved and mapped 6 test files:
- 03-failover-loop.tcl → Merged into existing failover.tcl
- 04-resharding.tcl → resharding.tcl
- 12-replica-migration-2.tcl + 12.1-replica-migration-3.tcl →
replica-migration-slow.tcl
- 07-replica-migration.tcl → Merged into existing replica-migration.tcl
- 28-cluster-shards.tcl → Merged into existing cluster-shards.tcl
Other changes:
- Converted old framework APIs (e.g., K, RI) to new framework APIs (e.g., R, srv)
- Added process_is_alive check in cluster_util.tcl to fix an exception in
failover tests caused by executing ps on dead processes
- Heavy tests (resharding, replica-migration-slow) marked with slow tag and
wrapped in run_solo to prevent resource contention in sanitizer environments
- replica-migration-slow marked with valgrind:skip tag since it is very slow
- Removed the entire tests/cluster/ directory including run.tcl, cluster.tcl,
includes/, and helpers/
- Kept runtest-cluster as a wrapper script (exec ./runtest --cluster "$@")
- Removed ./runtest-cluster calls from .github/workflows/daily.yml as cluster
tests are now included in ./runtest
Closes#2297.
Signed-off-by: Jun Yeong Kim <junyeonggim5@gmail.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
This PR bootstraps Valkey's provenance guard integration.
The provenance guard is a content-based similarity detection system that helps maintain proper code provenance by comparing incoming PR changes against fingerprint databases built from Redis commits and PRs. The matching logic now lives in the external `valkey-io/verify-provenance` action repository; this PR wires Valkey to that action and seeds the required database branch.
Key features:
* Content-based detection: Uses normalized diff fingerprints and fuzzy matching to detect similar changes, including cases where files have moved or been refactored.
* Externalized action logic: The check and refresh implementation is maintained in `valkey-io/verify-provenance` and is pinned by exact commit SHA from Valkey workflows.
* Provenance Guard workflow: Runs on PR activity to check incoming changes against the provenance databases and report potential matches.
* Daily Refresh workflow: Runs daily to refresh PR fingerprints and commits updated data back to `verify-provenance-db`.
* Dedicated DB branch: Stores provenance databases on the orphan `verify-provenance-db` branch, separate from Valkey source code.
* Privacy-first storage: Stores compressed non-reversible fingerprints, not source code.
The initial `verify-provenance-db` branch has been bootstrapped with fingerprints of Redis commits and PRs.
---------
Signed-off-by: Ping Xie <pingxie@outlook.com>
### Analysis
The daily CI sanitizer jobs with clang are failing during the build
step.
The `ubuntu-latest` runner now has clang 18, but the LLVM gold plugin
is still version 17. When the static Lua module is built with `-flto`,
the `.o` files contain LLVM 18 bitcode that the gold plugin (v17) cannot
read:
`bfd plugin: LLVM gold plugin has failed to create LTO module: Unknown
attribute kind (91) (Producer: 'LLVM18.1.3' Reader: 'LLVM 17.0.6')
`
Example failure:
https://github.com/valkey-io/valkey/actions/runs/24753491944/job/72421581512
### Fix
Pin the sanitizer jobs to `clang-17` so the compiler and gold plugin
versions match.
Tested(successfully built):
https://github.com/hanxizh9910/valkey/actions/runs/24859845008
### Note
If `clang-17` is removed from the `ubuntu-latest` image in the future,
we may need to either add an explicit install step
Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
The RXE project should keep the same version with the CI machine,
showing uname in RDMA CI job to find out the reason of kmod installing
failure.
Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>
Upload the entire results directory instead of only metrics JSON files.
This includes server logs which are useful for debugging benchmark
failures.
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Pin package manager dependencies in CI workflows to improve the Pinned-Dependencies
score in OpenSSF Scorecard.
Changes:
- benchmark-on-label.yml, benchmark-release.yml: add `--require-hashes`
to `pip install` adding on valkey-perf-benchmark repo:
https://github.com/valkey-io/valkey-perf-benchmark/pull/44
- ci.yml: pin `yamlfmt` to `v0.21.0` instead of `@latest`
- reply-schemas-linter.yml: use npm ci with `package-lock.json` instead
of unpinned npm install, package files in `utils/reply-schema-linter/`
Signed-off-by: Roshaan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Previously, our workflow used a global concurrency group, which
effectively limited execution to one running job and one pending job.
Any additional requests were automatically canceled, preventing a true
queue from forming.
We are now shifting to a model where we remove the concurrency
restriction and allow jobs to queue directly on the self-hosted runner.
This enables multiple workflow runs to be accepted and queued instead of
being dropped.
While GitHub can accept workflow triggers at a high rate (e.g., hundreds
per minute), the actual execution is still constrained by runner
capacity, in our case, a single runner processing one job at a time.
However, queued jobs are subject to GitHub’s 24-hour timeout policy.
This means any job that waits in the queue for more than 24 hours before
starting will be automatically canceled (timedout).
In practical terms, this approach improves reliability by eliminating
premature cancellations, but the effective queue size is still bounded
by how many jobs the runner can process within a 24-hour window. we
could increase the number of runners to run these in parallel.
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
The daily workflow was directly invoking the `valkey-unit-gtests` executable.
The intended invocation is to use `gtest-parallel` to ensure that the tests are executed in isolation.
Signed-off-by: harrylin98 <harrylin980107@gmail.com>
`weekly.yml` calls `daily.yml` with `use_git_ref` set to each release
branch (for example 7.2). But the checkout logic in `daily.yml` only
used `inputs.use_git_ref` when `github.event_name` was
`workflow_dispatch` or `workflow_call`. otherwise it fell back to
`github.ref`.
For reusable workflows, GitHub keeps the caller workflow’s github
context. That means when `weekly.yml` is triggered by schedule, the
called `daily.yml` still sees `github.event_name == 'schedule'` and
`github.ref` for the caller branch (unstable). As a result, jobs labeled
as release-branch runs could still check out unstable.
Added a guard for Gtest Unit Tests. It will skip the job if gtest is not
available / supported.
Run with CI Issue:
https://github.com/valkey-io/valkey/actions/runs/22815380713
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Carries on from where #3161 left off. The test-sanitizer-address-large-memory
jobs were being OOM-killed on GitHub-hosted runners (15.6GB RAM) due to
ASAN's 2-3x memory overhead.
Changes:
- Skip 4GB quicklist compression test under ASAN (requires ~16-24GB with
dual buffers + ASAN overhead)
- Reduce integration test sizes from 5GB to 4.1GB (preserves >4GB 32-bit
boundary coverage)
- Reduce XADD iterations from 10 to 3
- Add memory monitoring to track minimum free memory during CI runs
Signed-off-by: Rain Valentine <rsg000@gmail.com>
There is now a port of fast_float in C. So instead of having an optional
fast_float dependency, we can just use ffc instead, unconditionally.
https://github.com/kolemannix/ffc.h
It is a high quality port. The performance should be the same or
improved.
Note : I am the maintainer and main author of fast_float.
---------
Signed-off-by: Daniel Lemire <daniel@lemire.me>
Now we will be able to add a `run-cluster-benchmark` label to run a
benchmark with cluster-mode enabled valkey-server
It will use the config
https://github.com/valkey-io/valkey/blob/unstable/.github/benchmark_configs/benchmark-config-arm.json
modified for for cluster mode with a single clustermode enabled instance of
valkey.
It uses the same single instance for the benchmark as for run-benchmark.
If both labels are used, they are sequential in the same concurrency group `group:
ec2-al-2023-pr-benchmarking-arm64`.
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Since there is some mismatch between the already installed `ar` tool on
a macOS runner
and Clang 22, installed by brew; lets use the brew installed `llvm-ar`.
Expected to fix the issue in CI job `build-macos-latest`.
---------
Signed-off-by: Björn Svensson <bjorn.a.svensson@est.tech>
This PR fixes a Codecov workflow misconfiguration introduced when
upgrading codecov/codecov-action from v4 to v5 (in #3185).
In v5, the action expects files (plural), but the workflow still used
file.
The coverage shown is 0 right now:
https://app.codecov.io/gh/valkey-io/valkey
Documentation from -
https://github.com/codecov/codecov-action/tree/v5?tab=readme-ov-file#arguments
```
The following arguments have been changed
file (this has been deprecated in favor of files)
plugin (this has been deprecated in favor of plugins)
```
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
`SSL_get0_peer_certificate()` was introduced in OpenSSL 3.0. The recent
commit 7e110ae2b (Support TLS authentication using SAN URI) used it in
`tlsGetPeerUser()` without a version guard, breaking builds against
`OpenSSL 1.1.x.`
Use `SSL_get_peer_certificate()` on OpenSSL < 3.0 with the corresponding
`X509_free()` since the older API increments the reference count.
Fixes build failure: implicit declaration of function
`SSL_get0_peer_certificate [-Werror=implicit-function-declaration]`
Also fixes the version mismatch for almalinux 9 daily tests.
Closes#3304.
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
The CodeQL workflow is currently throwing a deprecation warning
regarding use of v3.
> CodeQL Action v3 will be deprecated in December 2026. Please update
all occurrences of the CodeQL Action in your workflow files to v4.
This PR introduces the following changes:
* References to CodeQL v3 have been updated to the SHA of the latest
CodeQL release, [v4.32.5].
Signed-off-by: Kurt McKee <contactme@kurtmckee.org>
Honors `workflow_call` inputs rather than checking out the
`GITHUB_HEAD_REF` always.
```
Determining the checkout info
/usr/bin/git branch --list --remote origin/8.0
origin/8.0
/usr/bin/git sparse-checkout disable
/usr/bin/git config --local --unset-all extensions.worktreeConfig
Checking out the ref
/usr/bin/git checkout --progress --force -B 8.0 refs/remotes/origin/8.0
Switched to a new branch '8.0'
branch '8.0' set up to track 'origin/8.0'.
```
Now the workflow checks out the right branch.
Link:
https://github.com/sarthakaggarwal97/valkey/actions/runs/22599943936/job/65479450708#step:3:83
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Publish OpenSSF Scorecard results, which means users and downstream
consumers can easily discover the project’s security best-practice
signals via Scorecard API.
Publishing Scorecard results:
- Improves transparency for users and integrators
- Provides early visibility into missing or improvable security
practices
Fixes#3162
---------
Signed-off-by: Gagan H R <hrgagan4@gmail.com>
This PR enables `USE_LIBBACKTRACE=yes` across all CI builds and builds
upon the changes introduced in #3034. Alpine-based jobs previously
attempted to install `libbacktrace-dev`, which does not exist in
Alpine’s apk repositories.
This caused these two errors in the daily tests below:
-
https://github.com/valkey-io/valkey/actions/runs/22045858351/job/63694456995
-
https://github.com/valkey-io/valkey/actions/runs/22045858351/job/63694457018
To resolve this, Alpine jobs now build GNU libbacktrace from source
inside the container before compiling Valkey. This aligns Alpine
behavior with other environments (Ubuntu jobs) and now avoids utilizing
non-existent Alpine packages.
An alternative approach we can consider is to disable `USE_LIBBACKTRACE`
for Alpine-based tests.
Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>
We made some changes to the workflow where the label was getting removed
every time we ran it. This changes handles it, removes
`pull_request_target` and doesn't re-trigger on adding another label.
I tried it here: https://github.com/sarthakaggarwal97/valkey/pull/65
The code is already merged in my unstable.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
With this we get more detailed backtrace information, including
information about static functions. Off by default - to enable you must
enable at compile time:
make USE_LIBBACKTRACE=yes
Signed-off-by: Rain Valentine <rsg000@gmail.com>
Updates to latest versions for each of the github actions used.
Pinning prevents an attack where the upstream action dependency is
compromised and the "v4" tag for example gets edited to point to a
malicious version. We already do this for most checkout actions in our
workflows.
---------
Signed-off-by: Rain Valentine <rsg000@gmail.com>
We have been seeing github actions runners being OOM when large memory
tests are run with ASan. The operation eventually is being canceled
during the test.
This change moves the large-memory tests with ASan and UBSan to separate
jobs, so we get a dedicated runner with its own timeout. We can tweak
the number of simultaneous test clients for these tests without
affecting the other test jobs.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Follow-up to my previous CMake PR
https://github.com/valkey-io/valkey/pull/2816.
**Changes:**
1. **`.github/workflows/ci.yml`** - Removed symlinks, use
`./build-release/runtest` instead of `./runtest`
2. **`tests/support/set_executable_path.tcl`** - Added
`::VALKEY_TLS_MODULE` variable
3. **Fixed hardcoded paths in 5 test files:**
- `tests/unit/tls.tcl` - server and TLS module paths
- `tests/unit/fuzzer.tcl` - benchmark path
- `tests/unit/cluster/cli.tcl` - CLI path
- `tests/support/server.tcl` - TLS module path
- `tests/instances.tcl` - TLS module path
**Result:** All tests passed. The only failure was an unrelated flaky
test (`client-eviction.tcl`) that's been failing since TLS was added to
the cmake job - tracked in issue #3146.
---------
Signed-off-by: Zhijun <dszhijun@gmail.com>
After #3103 time sensitive `test-ubuntu-reclaim-cache` started to fail
because now startup always includes 30ms of calibration of HW clock,
that's why we get this output:
```
Run echo "test SAVE doesn't increase cache"
test SAVE doesn't increase cache
2460491776
Could not connect to Valkey at 127.0.0.1:8080: Connection refused
```
Added waits for server to start, locally run, it helps
---------
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
We are already double running the tests with CMake, and we are building
CMake with TLS, so just making it so we run the tests with TLS. This
seems like an simple update so that we are always running the TLS tests.
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Currently, the weekly runs do not progress if there is a failed workflow
as github CI treats `fail-fast` to be true by default. With this change,
we continue to test all the branches even after failure.
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
The patch fixes the error by adding `pull‑requests: write` to the
permissions block of `weekly.yml`. Github rejects if the reusable
workflow's (here daily.yml) permissions are not provided.
We recently changed the permissions in `daily.yml` where we gave write
permissions in https://github.com/valkey-io/valkey/pull/2907
With this change, we bring parity to the permissions since `weekly.yml`
uses calls `daily.yml` workflow call method.
Fixes: https://github.com/valkey-io/valkey/actions/runs/20890545320
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Resolves: https://github.com/valkey-io/valkey/issues/2228
Visualization:
https://github.com/sarthakaggarwal97/valkey/actions/runs/19113712295
Currently, there are no tests running on the already released branches.
We often do backport for bug fixes and CVEs in these older versions, and
end up with multiple CI tests failures on these branches.
The PR adds support for running weekly tests on already released
versions `>= 7.2`. The workflow will execute the "daily" test workflow
for each of these branches on `Sunday 06:00 UTC`.
The idea is to continuously monitor our released versions through weekly
test runs (during the time when is lesser activity on github runners).
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Change the behaviour of the CI job triggered by the run-extra-tests
label.
Run the tests immediately when applying the run-extra-tests label to a
PR, without requiring an extra commit to be pushed to trigger the test
run.
When the extra tests have run, the job removes the label.
---------
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
GitHub has deprecated older macOS runners, and macos-13 is no longer supported.
1. The latest version of cross-platform-actions/action does allow
running on ubuntu-latest (Linux runner) and does not strictly require macOS.
2. Previously, cross-platform-actions/action@v0.22.0 used runs-on:
macos-13. I checked the latest version of cross-platform-actions, and
the official examples now use runs-on: ubuntu. I think we can switch from macOS to Ubuntu.
---------
Signed-off-by: Vitah Lin <vitahlin@gmail.com>
This adds the workflow improvements for PR and Release benchmark where
it runs on `c8g.metal-48xl` for `ARM64` and `c7i.metal-48xl` for `X86`.
```
Cluster mode: disabled
TLS: disabled
io-threads: 1, 9
Pipelining: 1, 10
Clients: 1600
Benchmark Treads: 90
Data size: 16 ,96
Commands: SET, GET
```
c8g.metal-48xl Spec: https://aws.amazon.com/ec2/instance-types/c8g/
c7i.metal.48xl Spec: https://aws.amazon.com/ec2/instance-types/c7i/
```
vCPU: 192
NUMA nodes: 2
Memory (GiB): 384
Network Bandwidth (Gbps): 50
```
PR benchmarking will be executed on **ARM64** machine as it has been
seen to be more consistent.
Additionally, it runs 5 iterations for each tests and posts the average
and other statistical metrics like
- CI99%: 99% Confidence Interval - range where the true population mean
is likely to fall
- PI99%: 99% Prediction Interval - range where a single future
observation is likely to fall
- CV: Coefficient of Variation - relative variability (σ/μ × 100%)
_Note: Values with (n=X, σ=Y, CV=Z%, CI99%=±W%, PI99%=±V%) indicate
averages from X runs with standard deviation Y, coefficient of variation
Z%, 99% confidence interval margin of error ±W% of the mean, and 99%
prediction interval margin of error ±V% of the mean. CI bounds [A, B]
and PI bounds [C, D] show the actual interval ranges._
For comparing between versions, it adds a workflow which runs on both
**ARM64** and **X86** machine. It will also post the comparison between
the versions like this:
https://github.com/valkey-io/valkey/issues/2580#issuecomment-3399539615
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: Roshan Khatri <117414976+roshkhatri@users.noreply.github.com>
This PR fixes the freebsd daily job that has been failing consistently
for the last days with the error "pkg: No packages available to install
matching 'lang/tclx' have been found in the repositories".
The package name is corrected from `lang/tclx` to `lang/tclX`. The
lowercase version worked previously but appears to have stopped working
in an update of freebsd's pkg tool to 2.4.x.
Example of failed job:
https://github.com/valkey-io/valkey/actions/runs/19282092345/job/55135193499
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
* Add cross version compatibility test to run with Valkey 7.2 and 8.0
* Add mechanism in TCL test to skip tests dynamically - #2711
---------
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Makes our tests possible to run with TCL 9.
The latest Fedora now has TCL 9.0 and it's working now, including the
TCL TLS package. (This wasn't working earlier due to some packaging
errors for TCL packages in Fedora, which have been fixed now.)
This PR also removes the custom compilation of TCL 8 used in our Daily
jobs and uses the system default TCL version instead. The TCL version
depends on the OS. For the latest Fedora, you get 9.0, for macOS you get
8.5 and for most other OSes you get 8.6.
The checks for TCL 8.7 are removed, because 8.7 doesn't exist. It was
never released.
---------
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
reduce the req and warmup time to finish in 6 hrs as the github workflow
times out after 6 hrs
---------
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>