Commit Graph

3103 Commits

Author SHA1 Message Date
Raja Subramanian 0cf53e2f0d Add option to force drain rtcService/agentService connections. (#4618)
When force: true, drain as fast as possible.
2026-06-23 16:10:50 +05:30
laosun 6658dd5454 Echo offered audio payload types in single-PC subscriber answer (#4614)
In single peer connection mode, when the server answers a subscriber's
offer, configureSenderAudio set the sender codec preferences from the
server MediaEngine's payload types. The answer could therefore advertise
Opus on a payload type the offerer never offered (server PT 111 vs
offered PT 109). Chrome tolerates this; Firefox decodes 0 samples
(silence) -- packets are received but never decoded. The forwarded RTP
already uses the offered PT, so only the answer SDP was inconsistent.
This regressed in v1.12.0 once the single-PC MediaEngine became a union
of publish+subscribe codecs.

Parse the remote offer's audio rtpmap and remap the sender audio codec
preferences to echo the offered payload types (RFC 3264 6.1) before
SetCodecPreferences.

Fixes #4599

Co-authored-by: laosun <14806343+cnvipstar@users.noreply.github.com>
2026-06-23 10:36:41 +05:30
Raja Subramanian 4facbc582a Move lock to addPendingTrack function. (#4617)
Wrapping the function with the lock outside in the only invocation was
not needed.
2026-06-23 10:30:59 +05:30
Raja Subramanian 1b69630a28 Prometheus metric for join latency. (#4616)
* Prometheus metric for join latency.

Also including a couple of other failures in the signal connection path
and moving the signal connected to after all that.

Not doing counters for the new signal failure paths. I should not have
done for the other two I added a little while ago also (
validation failure and start participant failure) as those are not
scalable to keep adding to node stats. Will probably remove those two
from node stats later. Can add those counters if they are useful.

* deprecate signal failed counters
2026-06-22 22:07:32 +05:30
Ryan Gaus 86a79f83fc fix: report participant capabilities in ParticipantInfo (#4606) 2026-06-22 09:23:33 -04:00
Raja Subramanian f7085535da Tighten up publish latency stat. (#4615)
Previously it was anchored to participant transitioning to `ACTIVE` if
the add track request happened before that. But, that has a few issues
1.`ACTIVE` is for primary peer connection which could be subscriber peer
connection.
2. `ACTIVE` also include data channel establishment.

Switch to first connected time of publisher peer connection for that to
get a more accurate measure of track publish time.
2026-06-22 17:36:06 +05:30
CloudWebRTC 13ce35fc87 fix: Clear the enableStartAtDesiredQuality flags in MaybeExpireAcquireGrace. (#4613) 2026-06-22 14:03:47 +08:00
Raja Subramanian a3a6b6de96 triviial: remove usused config. (#4611)
noticed a config in deploy config while cleaning up some other usused
config. small clean up. probably there is a bunch more that can be
cleaned up, but doing a quick one as I noticed this.
2026-06-21 16:37:21 +05:30
CloudWebRTC cfedcc71d0 feat: acquire requested video layer directly at HIGH quality by default (#4595)
* feat: acquire requested video layer directly at HIGH quality by default

Two changes that together remove the visible low->high quality ramp for a new
subscriber (both publisher-first and subscriber-first join orders):

1. Default a subscriber's initial video quality to HIGH on bind instead of LOW
   for adaptive stream, so the subscribed max layer is the top layer. Adaptive
   stream clients can still scale down afterwards based on viewport.

2. On initial layer acquisition the forwarder/selector latch directly onto the
   allocator's target (the requested top layer) instead of opportunistically
   latching onto the first lower key frame that arrives. A short
   initial-acquisition grace aims the target at the requested layer; if it does
   not show up in time, the target falls back to the highest layer seen so
   acquisition never stalls.

Always on - no configuration flag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat: gate start-at-desired-quality behind EnableStartAtDesiredQuality flag

Put the "acquire requested video layer directly at HIGH quality" behavior
behind a per-subscriber EnableStartAtDesiredQuality flag (default off, so
the original low->high ramp-up is restored unless enabled).

Plumbed from config.RTC.EnableStartAtDesiredQuality through ParticipantParams
-> SubscribedTrack/DownTrack -> Forwarder -> simulcast selector, gating all
three behavior changes: the HIGH default on bind, the forwarder's
initial-acquisition grace, and the selector's direct-latch-onto-target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* remove config.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-06-19 20:12:54 +08:00
Raja Subramanian a011d995da Do not call nil callback (#4607) 2026-06-18 23:24:33 +05:30
Raja Subramanian 9a7fe3cc68 Do not log due to negative getting interpreted as large unsigned positive (#4605) 2026-06-18 20:08:52 +05:30
Raja Subramanian c6303bb15a Fix skipped packets accounting. (#4604)
* Fix skipped packets accounting.

No need to copy unskipped packet RTP header to skipped packet.
That was causing padding bytes to be counted.

Also use Header.PaddingSize as base PaddingSize is deprecated.

* PaddingSize in header in utils
2026-06-18 11:58:39 +05:30
Paul Wells b882ccc86d service: cap all metadata at 512 KiB; enforce on join, agent dispatch, and embedded agents (#4602)
* service: enforce metadata size limit in CreateRoom, bump default to 512 KiB

CreateRoom previously accepted any metadata size; only UpdateRoomMetadata
rejected oversized payloads. Mirror the same CheckMetadataSize check at
the CreateRoom API boundary so both entrypoints are bounded.

Default MaxMetadataSize moves from 64000 to 512 * 1024 to match the
practical needs of customers using room metadata for richer state. The
limit remains configurable via the existing limits.max_metadata_size knob.

* service: split room vs. participant metadata limit, enforce on join + agent dispatch

LimitConfig.MaxMetadataSize was shared between room metadata and
participant metadata. Last commit's bump to 512 KiB lifted both ceilings;
this restores the participant ceiling to 64 KB and introduces a separate
MaxRoomMetadataSize (default 512 KiB) for room metadata.

Additional enforcement:

- RoomManager.StartSession rejects joins whose JWT-grants metadata or
  attributes exceed the participant/attributes limits. The check was
  missing entirely from this path.
- AgentDispatchService.CreateDispatch and the embedded
  CreateRoomRequest.Agents path now validate metadata and attributes
  against the common 64 KB ceilings (previously unbounded).

NewAgentDispatchService gains a LimitConfig parameter; the two wire_gen
callsites are updated.

* service: collapse metadata size limit to single 512 KiB knob

Reverts the LimitConfig split introduced in the previous commit:
MaxRoomMetadataSize, CheckRoomMetadataSize, and the max_room_metadata_size
yaml key are removed. MaxMetadataSize moves back to 512 * 1024 and gates
all metadata uniformly — room (CreateRoom, UpdateRoomMetadata), participant
(UpdateParticipant, signal UpdateMetadata, JWT grants on join), and agent
dispatch (CreateDispatch + embedded RoomAgentDispatch).

MaxAttributesSize stays at 64 KB and continues to gate participant and
agent-dispatch attributes separately.

Test cases consolidated under the single knob.

* kb -> kib
2026-06-17 12:35:59 -07:00
Raja Subramanian e7c63aa537 Log subscription limit breaches (#4603) 2026-06-18 00:08:39 +05:30
Raja Subramanian 67ca7a12cf Record more RTC cancellation points. (#4600)
There are several places the participant can drop off after initiating a
connection attempt. Count those places as cancellation including when
participant is closed due to specific reasons.

Cancels should be discounted when determining RTC/ICE connectivity
success/failure percentage.
2026-06-17 20:43:29 +05:30
Paul Wells 12a023ae45 agent: thread attributes map from dispatch to job (#4598)
* agent: thread simulation flag from dispatch to job

Reads simulation from AgentDispatch / RoomAgentDispatch and copies it
onto Job in agent.LaunchJob and the inline room-agent path so workers
see the flag.

Stacked on top of livekit/protocol#1629.

* agent: replace simulation bool with attributes map

Threads the renamed attributes map (was bool simulation) from dispatch
to job and bumps the protocol pseudo-version.

* deps
2026-06-16 01:53:01 -07:00
David Colburn 1f3e06107b egress v2 api (#4592)
* egress v2

* reorganize
2026-06-12 15:17:02 -04:00
cnderrauber 9746c9a9d6 Enforce subscriptio permission to data track (#4588)
* Enforce subscriptio permission to data track

* use revoke path as same as media track

* nil check
2026-06-12 16:02:12 +08:00
shishirng 08ab361e8e [WIP] rtc: add RestartSessionTimer to re-anchor participant session duration (#4566)
* rtc: add RestartSessionTimer to re-anchor participant session duration

Exposes ParticipantImpl.RestartSessionTimer so the session timer can be
re-anchored to the actual join time. Duration is only ever emitted once
the participant becomes active, so re-anchoring at join keeps pre-join
wall-clock out of the reported/billed duration. Adds the method to the
LocalParticipant interface (fake regenerated) and a local protocol
replace to pick up SessionTimer.Reset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* tidy

* update protocol

* report ended at for inactive sessions

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Paul Wells <paulwe@gmail.com>
2026-06-11 10:02:25 -07:00
Raja Subramanian 688cc66ed8 Add API to get latest node stats. (#4589) 2026-06-11 19:31:39 +05:30
Trey Hakanson 233a226438 Add ability to run pprof on dedicated HTTP server (#4584)
This allows exposing the pprof/debug endpoints in a production
environment more easily, where it shouldn't be exposed publicly.
2026-06-10 21:23:39 -07:00
cnderrauber 816d37281d Add grants expiry to Auth context (#4581) 2026-06-10 17:44:58 +08:00
cnderrauber 7dc6877738 Preserve original expiry when refreshing token (#4580)
To avoid shortening the token expiration time during
refreshing cause client reconnect failed after network
down for a long time (>5min).
2026-06-10 14:51:10 +08:00
Raja Subramanian 8d2b827f44 Add prom metrics for peer connectino state. (#4574)
* Add prom metrics for peer connectino state.

By direction (PUBLISHER vs SUBSCRIBER) and state ("started" ->
"connected"). This gives a way to track peer connections failing to
finish establishment.

The RTC active count can be useful for primary peer connection, but not
for non-primary. This counter can be used to track any and can generally
be used to understand success/failure rate of peer connection
establishment.

* add a couple of more states

* clean up and avoid duplicate reporting fully established

* staticcheck
2026-06-09 16:11:03 +05:30
David Zhao e0815be27d chore: improve docker test shutdown reliability (#4576) 2026-06-08 08:27:15 -07:00
Dan Root bfd9deffd7 expose TCPFallbackRTTThreshold and AllowUDPUnstableFallback via config (#4556) 2026-06-08 22:07:08 +08:00
renovate[bot] dc8e0310ad Update go deps to v4 (#4482)
* Update go deps to v4

Generated by renovateBot

* update dockertest to v4

* fix

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: David Zhao <dz@livekit.io>
2026-06-07 23:07:40 -07:00
Ben Mayer 20fd1ad2c1 turn: allow for providing secret via file (#4564)
* turn: allow for providing secret via file

* turn: improve secret_file changes
2026-06-08 11:18:14 +08:00
Paul Wells 77ecf920ff rtc: report participant session end time on room move (#4561)
MoveToRoom resets the participant reporter resolver to receive new
(room, participant_session) keys for the destination, but the source
room's participant_session row never gets an end_time — the periodic
duration scrape only emits one once disconnectedAt is set, and a move
doesn't transition the participant to DISCONNECTED. Report end_time
immediately before the reset so the row is closed out cleanly.
2026-06-03 21:35:39 -07:00
cnderrauber 63be96f631 Prevent panic from nil(illegal) syncState.Subscriptions message (#4560) 2026-06-04 10:32:24 +08:00
Raja Subramanian 835ef1b353 Metrics for participant active, i. e. fully established. (#4557)
* Metrics for participant active, i. e. fully established.

- Egress stub for v2 API
- Fix the participant canceled counter 🤦
- Add active counter -> this is increment when a participant becomes
  active, i. e. primary peer connection established. Can be used to
  monitor node wise connection establishment issues.
- Add singnalling validation fail counter.

With this, we have
- signalling validation fail
- signalling failed --> this is when the `startSession` fails
- signalling connected -> signalling is succesful and can send back
  joinResponse to client

on media connection side
- rtc_init -> start
- rtc_connected -> participant session created (joined)
- rtc_active -> primay peer connection established
- rtc_canceled -> could not proceed with RTC connection due to not being
  able to resume.

* signalling counters deps

* revert pion/webrtc to 4.2.12 to get SCTP without interleaving

* go back to pion/webrtc 4.2.11 and sctp 1.9.5
2026-06-03 19:50:19 +05:30
cnderrauber 356ae211a3 Config documentation for advertise_internal_ip and skip_external_ip_validation (#4552)
See https://github.com/livekit/mediatransportutil/pull/88
2026-06-01 14:37:08 +08:00
shishirng 7c319a67d4 rtc: prevent duration reporting for inactive participants (#4550)
Added a check to ensure that duration is not published for participants
that never became active.
2026-05-27 14:39:04 -04:00
Paul Wells 2dd5e63207 telemetry: split webhook-processed hook out of NewTelemetryService (#4548)
* telemetry: split webhook-processed hook registration out of NewTelemetryService

NewTelemetryService used to register a notifier processed-hook on the inner
*telemetryService directly. That made it impossible for downstream wrappers
(e.g. cloud's TelemetryService that overrides Webhook to fan out to a v3
observability pipeline) to intercept webhook events without double-firing
the legacy emission.

Lift the registration into a new exported helper RegisterWebhookHook, and
have the standalone server's wire provider createTelemetryService call it
right after construction so behavior is unchanged for callers that don't
wrap the service.
2026-05-27 09:40:55 -07:00
Paul Wells 222177a9e4 service: prevent nil deref in validate with wrapped join request (#4547)
When a client hits /rtc/v[01]/validate with a base64 WrappedJoinRequest
whose embedded JoinRequest.ClientInfo is unset, validateInternal called
AugmentClientInfo with a nil *ClientInfo and panicked at ci.Address =
GetClientIP(req). The non-wrapped branch already allocates via
ParseClientInfo; do the same here so pi.Client always gets at least the
resolved client Address.
2026-05-26 08:34:15 -07:00
Raja Subramanian dd7580b454 Protect against nil clientInfo (#4546) 2026-05-26 20:32:11 +05:30
Ninad Pundalik 145689e627 Start tracking Twirp method request latency in prometheus too, not just in logs (#4545)
* Start tracking Twirp method request latency in prometheus too, not just datadog
* Simplify latency tracking, do it in the logger itself
2026-05-26 14:53:16 +05:30
Paul Wells cde8962709 rtc: emit per-data-track bytes via BytesTrackStats (#4540)
Data tracks (the new _data_track datachannel) previously only updated a
private dataTrackStats that logged a single summary at Close. Bytes never
reached the OnTrackStats -> TelemetryService.TrackStats pipeline that
media tracks and signal channels feed.

Wire DataTrack (UPSTREAM, publisher-home) and DataDownTrack (DOWNSTREAM,
per-subscriber) into BytesTrackStats on the same 5s cadence, mirroring
the media-track convention: subscriber's country and ID with publisher's
track ID for DOWNSTREAM. Cross-region proxy DataTracks leave the stats
pointer nil (no publisher reporter on that node, and relayed bytes would
double-count). Legacy dataTrackStats packet-loss/frame counters are
preserved.
2026-05-23 17:42:55 -07:00
Raja Subramanian 2e22911dcd Remove backwards compatibility support for TURN auth. (#4539)
This was indiecated in release v1.12.0 - https://github.com/livekit/livekit/releases/tag/v1.12.0
2026-05-22 17:00:42 +05:30
Raja Subramanian 062d12197f Use NACKQuueInterface type. (#4538)
And some extra logging for subscription permission when it fails.
2026-05-21 23:00:51 +05:30
Paul Wells 7f08b04c1e Add IsIntentionalDisconnect helper (#4537)
Shared helper for callers that need to distinguish intentional/expected
participant closures (client leave, admin action, room teardown, migration)
from connection failures. Extracted from cloud's IsClosedIntentionally
switch so cloud-side code paths can share a single source of truth.
2026-05-20 11:42:51 -07:00
Raja Subramanian 1ab2bf043b Clean up packet size logging (#4536)
Reverting
- https://github.com/livekit/livekit/pull/4521
- https://github.com/livekit/livekit/pull/4525

There are TWCC feedback packets that are larger than MTU. Seems to
happen under a couple of conditions
1. Bad client data, i. e. severely out-of-order packets, bad sequence
   numbers, etc.
2. On an ICE restart - this is rare, but it seemed to be flaky network
   with some packets arriving and some not and causing a lot of gaps.

Either case, not much to do. If fargmentation/re-assembly back to
publisher works, the feedback will make it through. If not, feedbacks
will be missed and clients have to work with some missing data which is
not unexpected and the protocol is designed to handle.

However, filed pion/interceptor issue just in case - https://github.com/pion/interceptor/issues/416
2026-05-20 23:58:05 +05:30
cnderrauber 8ab92a80f6 Don't require media sections when joining (#4535)
* Don't require media sections when joining

Client except browser (rust/libwebrtc is known) could have problem
to fire ontrack event when reuses extra media section to subscribe
track, so disable this feature in server side and let client determine
if extra media sections are needed.

* lint
2026-05-20 13:28:51 +08:00
Paul Wells 019a6640ae rtc: report participant kind code and details (#4534)
* rtc: report participant kind code and details

Plumb ParticipantKind and KindDetails through MediaTrack and
BytesTrackStats so track-level reporting can record the numeric kind
code plus details codes on every participant_session aggregation,
alongside the existing Kind string. Also picks up the new kind fields
on resolved BytesSignalStats participants.

Adds deployment/agentID/version to the agent worker logger.
2026-05-18 23:20:52 -07:00
He Chen 77595d387a TEL-336: fix sip error categorization (#4528) 2026-05-18 15:44:44 -07:00
cnderrauber f303f499ef Always enable rtx codec (#4533)
Sfu will fallback to retransmit packet by media stream ssrc if rtx
is not negotiated (client doesn't have), so we should not disable
rtx explicitly (by codec config).

Fix #4519
2026-05-18 15:51:10 +08:00
Raja Subramanian e4a8a55c4b Check Less and LessEq in version compare. (#4532)
* Check Less and LessEq in version compare.

Thank you @cnderrauber for catching this.

* add test
2026-05-18 12:38:49 +05:30
Raja Subramanian 4a7b1e8587 Create NACK tracker only once. (#4527)
Not a major issue, but just avoiding duplicate creation of NACK module.
RTCP feedback of `nack` and `nack pli` end up getting treated as `nack`
and was double creating.
2026-05-15 12:45:51 +05:30
cnderrauber 89faaeba82 Apply ttl check only when authenticate allocation creating (#4526)
* Apply ttl check only when authenticate allocation creating

TTL check could reject allocation/persmission refresh in
security enhancement #4505, cause long-live session disconnect
when turn credential is expired.
Only check ttl on allocation creating to prevent abusing leaked
credential but keep long-live session work.
2026-05-15 14:55:05 +08:00
Raja Subramanian b32933b0d4 Log details of RTCP packets. (#4525)
* Log details of RTCP packets.

Seeing large (> MTU) packets on publisher peer connection RTCP. The
four types there are
- RTCP Receiver Reports
- NACK
- TWCC
- PLI

Can't think of what would be blowing up in size.

RTCP Receiver Report and PLI are fixed in size

NACKs vary, but the limit is 100 NACKs which should fit in 400 bytes
even if all of them are spread apart in the sequence number space.

TWCC varies, but a feedback packet is sent every 100ms or when it holds
100 packets. So, that also should not be too big.

Logging packet details to understand this better.

* revert debug
2026-05-14 18:55:00 +05:30