1537 Commits

Author SHA1 Message Date
Raja Subramanian 23090163ce Configurable migration wait duration for longer waits in simulation. (#4624)
Only applies if it is more than the default 3 seconds.
2026-06-26 20:01:32 +05:30
Raja Subramanian 1faab0c48e Add support for data blob (a. k. a. async participant attributes) (#4619)
* Async attributes on participant.

How it is different from existing participant attributes?
1. Async attribute can be added one at a time.
2. These are not included in `ParticipantInfo`.
3. Get an attribute bt participant identity and async attribute ID as
   and when needed.

* clean up

* get full definitions, not just ids

* listener OnDataTrackSchema

* name length config

* data blob

* deps

* static check

* Add missing request ID

* Update protocol commit

* Wire up StoreDataBlobResponse

* Pass request ID through in GetDataBlobResponse

* deps

* atomic

* sctp at 1.9.5

* remove proto clone

---------

Co-authored-by: Jacob Gelman <3182119+ladvoc@users.noreply.github.com>
2026-06-24 14:42:37 +05:30
laosun 6658dd5454 Echo offered audio payload types in single-PC subscriber answer (#4614)
In single peer connection mode, when the server answers a subscriber's
offer, configureSenderAudio set the sender codec preferences from the
server MediaEngine's payload types. The answer could therefore advertise
Opus on a payload type the offerer never offered (server PT 111 vs
offered PT 109). Chrome tolerates this; Firefox decodes 0 samples
(silence) -- packets are received but never decoded. The forwarded RTP
already uses the offered PT, so only the answer SDP was inconsistent.
This regressed in v1.12.0 once the single-PC MediaEngine became a union
of publish+subscribe codecs.

Parse the remote offer's audio rtpmap and remap the sender audio codec
preferences to echo the offered payload types (RFC 3264 6.1) before
SetCodecPreferences.

Fixes #4599

Co-authored-by: laosun <14806343+cnvipstar@users.noreply.github.com>
2026-06-23 10:36:41 +05:30
Raja Subramanian 4facbc582a Move lock to addPendingTrack function. (#4617)
Wrapping the function with the lock outside in the only invocation was
not needed.
2026-06-23 10:30:59 +05:30
Ryan Gaus 86a79f83fc fix: report participant capabilities in ParticipantInfo (#4606) 2026-06-22 09:23:33 -04:00
Raja Subramanian f7085535da Tighten up publish latency stat. (#4615)
Previously it was anchored to participant transitioning to `ACTIVE` if
the add track request happened before that. But, that has a few issues
1.`ACTIVE` is for primary peer connection which could be subscriber peer
connection.
2. `ACTIVE` also include data channel establishment.

Switch to first connected time of publisher peer connection for that to
get a more accurate measure of track publish time.
2026-06-22 17:36:06 +05:30
CloudWebRTC cfedcc71d0 feat: acquire requested video layer directly at HIGH quality by default (#4595)
* feat: acquire requested video layer directly at HIGH quality by default

Two changes that together remove the visible low->high quality ramp for a new
subscriber (both publisher-first and subscriber-first join orders):

1. Default a subscriber's initial video quality to HIGH on bind instead of LOW
   for adaptive stream, so the subscribed max layer is the top layer. Adaptive
   stream clients can still scale down afterwards based on viewport.

2. On initial layer acquisition the forwarder/selector latch directly onto the
   allocator's target (the requested top layer) instead of opportunistically
   latching onto the first lower key frame that arrives. A short
   initial-acquisition grace aims the target at the requested layer; if it does
   not show up in time, the target falls back to the highest layer seen so
   acquisition never stalls.

Always on - no configuration flag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat: gate start-at-desired-quality behind EnableStartAtDesiredQuality flag

Put the "acquire requested video layer directly at HIGH quality" behavior
behind a per-subscriber EnableStartAtDesiredQuality flag (default off, so
the original low->high ramp-up is restored unless enabled).

Plumbed from config.RTC.EnableStartAtDesiredQuality through ParticipantParams
-> SubscribedTrack/DownTrack -> Forwarder -> simulcast selector, gating all
three behavior changes: the HIGH default on bind, the forwarder's
initial-acquisition grace, and the selector's direct-latch-onto-target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* remove config.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-06-19 20:12:54 +08:00
Raja Subramanian a011d995da Do not call nil callback (#4607) 2026-06-18 23:24:33 +05:30
Raja Subramanian e7c63aa537 Log subscription limit breaches (#4603) 2026-06-18 00:08:39 +05:30
Raja Subramanian 67ca7a12cf Record more RTC cancellation points. (#4600)
There are several places the participant can drop off after initiating a
connection attempt. Count those places as cancellation including when
participant is closed due to specific reasons.

Cancels should be discounted when determining RTC/ICE connectivity
success/failure percentage.
2026-06-17 20:43:29 +05:30
Paul Wells 12a023ae45 agent: thread attributes map from dispatch to job (#4598)
* agent: thread simulation flag from dispatch to job

Reads simulation from AgentDispatch / RoomAgentDispatch and copies it
onto Job in agent.LaunchJob and the inline room-agent path so workers
see the flag.

Stacked on top of livekit/protocol#1629.

* agent: replace simulation bool with attributes map

Threads the renamed attributes map (was bool simulation) from dispatch
to job and bumps the protocol pseudo-version.

* deps
2026-06-16 01:53:01 -07:00
cnderrauber 9746c9a9d6 Enforce subscriptio permission to data track (#4588)
* Enforce subscriptio permission to data track

* use revoke path as same as media track

* nil check
2026-06-12 16:02:12 +08:00
shishirng 08ab361e8e [WIP] rtc: add RestartSessionTimer to re-anchor participant session duration (#4566)
* rtc: add RestartSessionTimer to re-anchor participant session duration

Exposes ParticipantImpl.RestartSessionTimer so the session timer can be
re-anchored to the actual join time. Duration is only ever emitted once
the participant becomes active, so re-anchoring at join keeps pre-join
wall-clock out of the reported/billed duration. Adds the method to the
LocalParticipant interface (fake regenerated) and a local protocol
replace to pick up SessionTimer.Reset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* tidy

* update protocol

* report ended at for inactive sessions

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Paul Wells <paulwe@gmail.com>
2026-06-11 10:02:25 -07:00
cnderrauber 7dc6877738 Preserve original expiry when refreshing token (#4580)
To avoid shortening the token expiration time during
refreshing cause client reconnect failed after network
down for a long time (>5min).
2026-06-10 14:51:10 +08:00
Raja Subramanian 8d2b827f44 Add prom metrics for peer connectino state. (#4574)
* Add prom metrics for peer connectino state.

By direction (PUBLISHER vs SUBSCRIBER) and state ("started" ->
"connected"). This gives a way to track peer connections failing to
finish establishment.

The RTC active count can be useful for primary peer connection, but not
for non-primary. This counter can be used to track any and can generally
be used to understand success/failure rate of peer connection
establishment.

* add a couple of more states

* clean up and avoid duplicate reporting fully established

* staticcheck
2026-06-09 16:11:03 +05:30
Paul Wells 77ecf920ff rtc: report participant session end time on room move (#4561)
MoveToRoom resets the participant reporter resolver to receive new
(room, participant_session) keys for the destination, but the source
room's participant_session row never gets an end_time — the periodic
duration scrape only emits one once disconnectedAt is set, and a move
doesn't transition the participant to DISCONNECTED. Report end_time
immediately before the reset so the row is closed out cleanly.
2026-06-03 21:35:39 -07:00
cnderrauber 63be96f631 Prevent panic from nil(illegal) syncState.Subscriptions message (#4560) 2026-06-04 10:32:24 +08:00
cnderrauber 356ae211a3 Config documentation for advertise_internal_ip and skip_external_ip_validation (#4552)
See https://github.com/livekit/mediatransportutil/pull/88
2026-06-01 14:37:08 +08:00
shishirng 7c319a67d4 rtc: prevent duration reporting for inactive participants (#4550)
Added a check to ensure that duration is not published for participants
that never became active.
2026-05-27 14:39:04 -04:00
Paul Wells cde8962709 rtc: emit per-data-track bytes via BytesTrackStats (#4540)
Data tracks (the new _data_track datachannel) previously only updated a
private dataTrackStats that logged a single summary at Close. Bytes never
reached the OnTrackStats -> TelemetryService.TrackStats pipeline that
media tracks and signal channels feed.

Wire DataTrack (UPSTREAM, publisher-home) and DataDownTrack (DOWNSTREAM,
per-subscriber) into BytesTrackStats on the same 5s cadence, mirroring
the media-track convention: subscriber's country and ID with publisher's
track ID for DOWNSTREAM. Cross-region proxy DataTracks leave the stats
pointer nil (no publisher reporter on that node, and relayed bytes would
double-count). Legacy dataTrackStats packet-loss/frame counters are
preserved.
2026-05-23 17:42:55 -07:00
Raja Subramanian 062d12197f Use NACKQuueInterface type. (#4538)
And some extra logging for subscription permission when it fails.
2026-05-21 23:00:51 +05:30
Paul Wells 7f08b04c1e Add IsIntentionalDisconnect helper (#4537)
Shared helper for callers that need to distinguish intentional/expected
participant closures (client leave, admin action, room teardown, migration)
from connection failures. Extracted from cloud's IsClosedIntentionally
switch so cloud-side code paths can share a single source of truth.
2026-05-20 11:42:51 -07:00
Raja Subramanian 1ab2bf043b Clean up packet size logging (#4536)
Reverting
- https://github.com/livekit/livekit/pull/4521
- https://github.com/livekit/livekit/pull/4525

There are TWCC feedback packets that are larger than MTU. Seems to
happen under a couple of conditions
1. Bad client data, i. e. severely out-of-order packets, bad sequence
   numbers, etc.
2. On an ICE restart - this is rare, but it seemed to be flaky network
   with some packets arriving and some not and causing a lot of gaps.

Either case, not much to do. If fargmentation/re-assembly back to
publisher works, the feedback will make it through. If not, feedbacks
will be missed and clients have to work with some missing data which is
not unexpected and the protocol is designed to handle.

However, filed pion/interceptor issue just in case - https://github.com/pion/interceptor/issues/416
2026-05-20 23:58:05 +05:30
cnderrauber 8ab92a80f6 Don't require media sections when joining (#4535)
* Don't require media sections when joining

Client except browser (rust/libwebrtc is known) could have problem
to fire ontrack event when reuses extra media section to subscribe
track, so disable this feature in server side and let client determine
if extra media sections are needed.

* lint
2026-05-20 13:28:51 +08:00
Paul Wells 019a6640ae rtc: report participant kind code and details (#4534)
* rtc: report participant kind code and details

Plumb ParticipantKind and KindDetails through MediaTrack and
BytesTrackStats so track-level reporting can record the numeric kind
code plus details codes on every participant_session aggregation,
alongside the existing Kind string. Also picks up the new kind fields
on resolved BytesSignalStats participants.

Adds deployment/agentID/version to the agent worker logger.
2026-05-18 23:20:52 -07:00
Raja Subramanian b32933b0d4 Log details of RTCP packets. (#4525)
* Log details of RTCP packets.

Seeing large (> MTU) packets on publisher peer connection RTCP. The
four types there are
- RTCP Receiver Reports
- NACK
- TWCC
- PLI

Can't think of what would be blowing up in size.

RTCP Receiver Report and PLI are fixed in size

NACKs vary, but the limit is 100 NACKs which should fit in 400 bytes
even if all of them are spread apart in the sequence number space.

TWCC varies, but a feedback packet is sent every 100ms or when it holds
100 packets. So, that also should not be too big.

Logging packet details to understand this better.

* revert debug
2026-05-14 18:55:00 +05:30
Raja Subramanian ef2e5efe14 Log large packets receive/send. (#4521)
* Log large packets receive/send.

Seeing cases of servers reporting need for segmentation/re-assembly of
packets. So, logging packet receive/send for RTP/RTCP to check if
anything is seeing more than 1400 byte packets.

* log downtrack RTCP too
2026-05-13 16:04:53 +05:30
Raja Subramanian 20d4a3a168 Populate data track loggers with context (#4514) 2026-05-09 10:14:48 +05:30
Paul Wells 803999efad rename agent environment to deployment (#4506)
* rename agent environment to deployment

* deps
2026-05-05 14:19:40 -07:00
Paul Wells 253f977d32 add duration seconds reporting (#4500)
* add duration seconds reporting

* deps

* deps
2026-05-02 06:19:23 -07:00
Paul Wells ffab3bd308 add agent environment (#4498)
* add agent environment

* lint

* psrpc error

* deps
2026-05-01 19:30:06 -07:00
Raja Subramanian ccdf23c8a6 Use mediatransportutil/codec package, no functional change (#4497) 2026-05-01 20:06:29 +05:30
olafal0 f51798bcf6 Fix publish-only limitations being incorrectly applied to receivers (#4495)
* Fix publish-only limitations being incorrectly applied receive-side in a single PC

* `StaticConfigurations` disabled some codecs for publish only, which worked in dual PC
* In single PC, the server incorrectly disabled these codecs in both directions
* Dual PC mode is unchanged; single PC handles per-direction filtering correctly

* Filter recv-side codecs to publish list in single-PC SDP answer

* Confirm H264 is present in offer in test
2026-04-30 18:49:34 +05:30
Raja Subramanian a002337db1 Legacy TrackInfo.Simulcast flag. (#4493)
* Legacy TrackInfo.Simulcast flag.

When AddTrack did not send SimulcastCodecs, the legacy `Simulcast` flag
was not set. Fix it by setting the flag when a second layer is
published.

* staticcheck

* use the existing PrimaryReceiver function
2026-04-29 22:43:33 +05:30
Paul Wells d7c2daf1ac report all simulcast layers (#4491) 2026-04-28 10:45:32 -07:00
Jacob Gelman 19b9e8c00a Additional data tracks logging (#4489)
* Additional data track logging

* Track total bytes published

* Rename field
2026-04-28 21:26:07 +09:00
David Chen 743d9c8b3a add support for client capabilities (#4461)
* update protocol version

* only check for client capabiltiy to strip packet trailer
2026-04-27 17:58:36 -07:00
Raja Subramanian fc47e47866 Close peer connection unconditionally to unblock set local/remote (#4485)
* Close peer connection unconditionally to unblock set local/remote
description operations.

Have been chasing a leak where participants have a lot of connectivity
issues and analysed a goref with Claude. Output below.

Jo Turk quickly patched sctp for reported issue -
https://github.com/pion/sctp/pull/465.

This PR moves the peer connection close to before waiting for events
queue to be drained as event queue could be blocked on
`SetLocal/RemoteDescription` hanging.

The scenario is a bit far-fetched as a lot of things have to happen, but
it does point to a scenario where things could hang. Remains to be seen
if this helps. Note that closing the peer connection early could mean
the contained objects (like data channels) could all be closed as part
of the peer connection close. But, still keeping the explicit clean up
path (which should effectively become no-op) to minimise changes.

------------------------------------------------------------------

The wedge is in pion/sctp's blocking-write gate, called synchronously from inside the PC's operations queue. Five things have to be true at the same time, and on this build they all are:

  1. SCTPTransport.Start is synchronous in the SetRemoteDescription op

  The stuck stack:
  PeerConnection.SetRemoteDescription.func2  (peerconnection.go:1363)
    → startRTP → startSCTP
      → SCTPTransport.Start         (sctptransport.go:141)
        → DataChannel.open          (datachannel.go:178)
          → datachannel.Dial → Client → Stream.WriteSCTP
            → Association.sendPayloadData    (association.go:3141)  ← blocks here
  SCTPTransport.Start synchronously sends the DCEP "OPEN" for each pre-negotiated channel. The operations.start goroutine runs SetRemoteDescription's logic; it does not return until Start does.

  2. The wait has no deadline

  Stream.WriteSCTP (stream.go:289) calls sendPayloadData(s.writeDeadline, ...). s.writeDeadline is the default zero-value deadline.Deadline — never armed, because DataChannel.Dial doesn't call Stream.SetWriteDeadline. So the <-ctx.Done() arm of the wait select can
  never fire.

  3. EnableDataChannelBlockWrite(true) puts SCTP into a serialized-write gate

  At livekit-server/pkg/rtc/transport.go:362 livekit calls se.EnableDataChannelBlockWrite(true). That flips the sendPayloadData path to:
  // association.go:3138-3148
  if a.blockWrite {
      for a.writePending {
          a.lock.Unlock()
          select {
          case <-ctx.Done():        // never (no deadline)
          case <-a.writeNotify:     // only fires when writeLoop fully drains pendingQueue
          }
          a.lock.Lock()
      }
      a.writePending = true
  }

  4. writeNotify only fires after the writeLoop drains everything

  The only place notifyBlockWritable is called is gatherOutbound (association.go:3085-3088), and only when len(chunks) > 0 && a.pendingQueue.size() == 0 — i.e., the writeLoop actually managed to move all pending chunks to inflight. If cwnd is full and SACKs stop
  arriving, the writeLoop wakes up, sees zero room, sends nothing, and writePending stays true.

  5. There is no association-level abort timer for data writes

  At association.go:764:
  assoc.t3RTX = newRTXTimer(timerT3RTX, assoc, noMaxRetrans, rtoMax)
  noMaxRetrans means the retransmission timer never gives up. INIT has maxInitRetrans, but data does not. There is no equivalent of TCP's tcp_retries2 → ETIMEDOUT → ABORT. So once the path is dead post-handshake, t3RTX keeps firing into the void and the association
   never transitions out of established on its own.

  What it takes to wake it up

  Only an external close: somebody has to terminate the underlying DTLS conn (which makes Association.readLoop's netConn.Read fail, which closes closeWriteLoopCh, which lets timerLoop exit). But — and this is the kicker — readLoop's defer at association.go:976-996
  closes everything except it does not call notifyBlockWritable. So even if readLoop unwinds, any goroutine parked on <-a.writeNotify stays parked unless it was watching ctx (which here it isn't).

  So the trigger sequence on this pod was almost certainly:
  1. Peer establishes ICE+DTLS+SCTP, association goes established.
  2. Peer disappears (ICE silently fails, NAT rebinding, OS sleep, kill -9, etc.).
  3. The first DCEP-OPEN for one of livekit's pre-negotiated channels is queued; cwnd never opens because no SACKs return.
  4. writePending is now true for the lifetime of the process, with no deadline, no ctx, no kill.
  5. The PC's operations queue is wedged, SetRemoteDescription never returns, livekit-server's handleRemoteOfferReceived event handler is parked, the participant is never torn down, and the SCTP timerLoop pins the entire participant graph in memory until OOM-kill.

  Realistic fixes (in order of how clean they are)

  1. Upstream: in pion/sctp, broadcast notifyBlockWritable() (or close writeNotify) inside readLoop's defer cleanup, so a closed association unblocks any pending writers. This is the right fix.
  2. livekit-server: wrap pc.SetRemoteDescription(...) with a timeout, and on timeout call pc.Close() — Close ultimately tears down the DTLS conn, which lets readLoop exit (point 1 still needs to be true for the writer goroutine to actually unblock, though).
  3. Workaround: call stream.SetWriteDeadline(...) on the SCTP stream before issuing the DCEP open, so the ctx arm of the select can fire. Requires reaching past webrtc.DataChannel though.
  4. Heaviest hammer: don't pre-negotiate the data channels inline with SetRemoteDescription — open them lazily after PC reaches connected so a stuck open never blocks signaling.

  Without (1), even (2) leaves the writer goroutine itself parked forever — but at least the PC and its participant-side state would be released; only the SCTP goroutine subtree (much smaller) would leak.

* revert probe stop change

* handle nil offer
2026-04-27 21:38:46 +05:30
Raja Subramanian 3a7f2628b0 Turn off transceiver re-use on Safari. (#4474)
There are issues with insertable streams + Safari which causes tracks to
go missing mid-stream sometimes.
2026-04-23 19:04:10 +05:30
Raja Subramanian 701a37c2d1 Convert sort.Slice -> slices.SortFunc (#4472)
* Convert sort.Slice -> slices.SortFunc

* active speaker loudness in descending order
2026-04-23 15:12:24 +05:30
Raja Subramanian 31083307ec do not log data track stats if not started (#4468) 2026-04-23 10:46:33 +05:30
Anunay Maheshwari 9ee06635d6 feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules (#4466)
* feat(pion/ice): replace deprecated NAT1To1 with SetAddressRewriteRules

* update deps
2026-04-22 12:49:36 +05:30
Raja Subramanian dbf5cf6196 Store concrete ICE candidate for remote candidates. (#4458) 2026-04-17 13:14:47 +05:30
Raja Subramanian 3cfb71e7ca Use Muted in TrackInfo to propagated published track muted. (#4453)
* Use Muted in TrackInfo to propagated published track muted.

When the track is muted as a receiver is created, the receiver
potentially was not getting the muted property. That would result in
quality scorer expecting packets.

Use TrackInfo consistently for mute and apply the mute on start up of a
receiver.

* update mute of subscriptions
2026-04-16 01:03:40 +05:30
Raja Subramanian 69aa94797b Some drive-by clean up (#4452) 2026-04-15 12:23:33 +05:30
Raja Subramanian 6c81f67858 Add subscriber stream start event notification (#4449) 2026-04-14 22:08:31 +05:30
cnderrauber ce1bf47b5c Revert "fix: ensure num_participants is accurate in webhook events (#4265) (#…" (#4448)
This reverts commit cdb0769c38.
2026-04-13 22:21:22 +08:00
Onyeka Obi cdb0769c38 fix: ensure num_participants is accurate in webhook events (#4265) (#4422)
* fix: ensure num_participants is accurate in webhook events (#4265)

  Three fixes for stale/incorrect num_participants in webhook payloads:

  1. Move participant map insertion before MarkDirty in join path so
     updateProto() counts the new participant.
  2. Use fresh room.ToProto() for participant_joined webhook instead of
     a stale snapshot captured at session start.
  3. Remove direct NumParticipants-- in leave path (inconsistent with
     updateProto's IsDependent check), force immediate proto update,
     and wait for completion before triggering onClose callbacks.

* fix: use ToProtoConsistent for webhook events instead of forcing immediate updates
2026-04-13 09:26:14 +08:00
Raja Subramanian c91e79af35 Switch to stdlib maps, slices (#4445)
* Switch to stdlib maps, slices

* slices
2026-04-13 00:11:48 +05:30
David Zhao 4b3856125c chore: pin GH commits and switch to golangci-lint (#4444)
* chore: pin GH commits

* switch to golangci-lint-action

* fix lint issues
2026-04-11 13:04:22 -07:00