Engineering

John Nastos, Fernando Barbat

Mar 5, 2024

Making Illegible, Slow WebRTC Screenshare Legible and Fast

Pixelated cursors gradually becoming less pixelated

An app with a shared control experience needs to have incredibly sharp video and <100ms latency to feel good. When Multi launched in August of last year, our shared control implementation was good, but it was—waves hands—"noticeably" slow.

(This is Part 2/2 of a series on control latency. Part 1 covers how we measure and analyze latency.)

We were already using a P2P connection to send control events. Some research and testing revealed that if we could completely cut out Zoom’s servers and by using the P2P connection for video frames as well, we could cut half or more of the latency.

A diagram showing where the server interacts with clients

Rather than try to recreate the wheel, the obvious solution was to use WebRTC. After a relatively long build phase (WebRTC is… not trivial to implement), we excitedly started up our test apps and did our first few runs with what we were sure would be our new low-latency P2P solution. Our result:

So, how did we miss the mark so much? It turns out that WebRTC by default is not configured for highly legible, low-latency screenshare. It required some tweaks (sometimes deep inside the internals of the WebRTC source) to achieve the results we were looking for.

Let’s take a look at some of the factors that affect the resolution and speed of video in WebRTC, and what we had to manipulate to make our way towards the upper right corner of our legibility/speed matrix.

Sum of smaller changes making screenshare legible and fast

Bandwidth

If we had unlimited bandwidth, we could simply deliver every frame in its full, uncompressed state, at the speed of light. That image would appear on the remote machine just as it did on the local machine. But, in the real world, we always have some constraint on bandwidth, so we’re forced to try to represent as high quality an image as possible within the bandwidth that we have available.

From RGB to YUV: The Initial Resolution Compromise

Usually the first bandwidth/resolution compromise comes in the image format used for screen sharing. Although it is common to deal with images programmatically in RGB or RGBA formats, where each pixel is represented by 3 or 4 bytes respectively, oftentimes in video applications, we instead use YUV, with chroma subsampling.

YUV separates the luminance (Y) from the chrominance (U and V) components of the image. In 420i, which is a common WebRTC format, the Y channel is represented by 1 byte per pixel, just like RGBA, but the U and V channels are subsampled, represented by 2 bytes for every 4 pixel square.

Since the human eye is more sensitive to brightness than to color, this separation allows video systems to transmit and store chrominance at a lower resolution than luminance, reducing the amount of data needed without affecting the perceived resolution of the image much.

Frame Rate Reduction: A Double-Edged Sword

Another intuitive approach to saving bandwidth is reducing the frame rate (FPS). At first glance, this seems straightforward: fewer frames per second means less data to transmit. However, the impact of FPS on video quality is not so simple. Lowering the frame rate can save bandwidth, but it can also introduce undesirable effects, particularly on the encoder's side. These effects range from perceptible choppiness in fast-moving scenes to potential complications in encoding efficiency. The choice of frame rate is a balancing act between bandwidth savings and maintaining a smooth video experience.

The frame rate of the stream can also have latency implications. If the remote user is just viewing the stream, a small change in latency might not be significant. But, in scenarios where the remote user is interacting with the stream (such as Multi’s shared control feature), it’s important to shave off every bit of latency possible to make the experience feel as good as possible.

For example, let’s consider the case of a relatively-low 10 fps stream. The frames get captured like this, where F represents a frame and - represents 10ms.

[F ---------- F ---------- F ---------- F]

When a shared control command is executed and its result is displayed on the screen, latency occurs while we wait for the next frame capture. Here we should two events E1 and E2 that happen to respectively occur right after and right before a frame capture. We can see that latency in this capture pipeline varies from 100ms to ~0ms.

[F E1 ---------- F ---------- E2 F]

As the frame rate increases, the interframe delay decreases, decreasing the maximum latency at this step. At 30FPS, E3 has a max delay of 33ms.

[F E3 --- F]

However, we can’t just crank up the fps to 100, because we'll run into bandwidth and CPU constraints.

The Role of the Encoder

Encoders play a pivotal role in the video compression process. Their job is to transform raw video data (in our case 420i data, as discussed above) into a compressed format that's easier to transmit over networks. Here's a brief overview of what encoders achieve:

Compression: Reducing the size of video data while trying to retain as much of the original quality as possible.
Adaptation: Adjusting the video stream in real-time to suit the available network conditions and device capabilities.
Efficiency: Maximizing the quality of video within the available bandwidth.

In the context of WebRTC, VP8, H.264, and VP9 are the most common encoders. Each has its strengths and trade-offs in terms of compression efficiency, licensing requirements, and computational demand.

Our out-of-the-box H264 and VP9 experience

The encoder can have dramatic effects on both quality and latency. In our initial experiments with WebRTC at Multi, we tried H264 as our encoder, hoping that we would find the out-of-the-box settings ideal for our high-resolution/low-latency requirements. It seemed like a good choice for Apple platforms, where H264 has good software and hardware support. However, with non-customized settings, it didn’t offer the image quality or low latency that we needed for our screen sharing feature.

VP9 fared better out-of-the-box, especially with latency. But, like H264, it seemed too quick to significantly degrade the image quality, which was not a good compromise to make for our users’ screen sharing experience.

Diving Deeper: Encoder Settings

Our next step was to alter the behavior of the encoder beyond the out-of-the-box experience. When configuring an encoder, several settings directly impact video quality and bandwidth consumption. Let's explore these:

Frame Rate (FPS) Revisited

Increased FPS (to a point) improves legibility and speed

As mentioned, FPS is a critical setting. While reducing it can save bandwidth, the choice of frame rate should consider the content's nature and the application's requirements to avoid negatively impacting the viewer's experience.

In our application, we limit the FPS to 30 and set the degradationPreference to maintainResolution — we always prefer resolution over FPS.

Bitrate

The bitrate of the stream is what ultimately determines bandwidth. It is directly tied to other properties, such as FPS or QP. For example, a low FPS stream will have a lower bitrate than a high FPS stream if all other variables (resolution. compression, etc) are the same.

The encoders have target, minimum, and maximum bitrate settings that may, it turn, affect those properties. For example, if a stream has a maximum bitrate of 2610 kbps, the encoder may decide to drop frames in order to not exceed this rate. There may also be “overshoot” or “undershoot” parameters that determine how much a stream should temporarily adjust the target bitrate to accommodate additional complexity in the stream.

In Multi’s case, in order to conserve bandwidth, we adjusted the default “undershoot” of the bitrate in order to allow the stream to go under the target bitrate in situations in which the frames did not have much movement.

Layers

Some encoders, including VP9, support layering techniques that allow for adaptive streaming. By encoding a video into multiple layers of quality and resolution, a stream can dynamically adjust to varying network conditions, delivering the best possible quality at any given moment. For example, a client with constrained bandwidth can use the lower quality layer while a client with a great connection can get the high-quality data, all from the same stream.

In the Multi’s screen sharing use case, encoding in multiple layers was not something we wanted. If a user’s bandwidth is constrained, we’d prefer to show fewer frames (lower FPS) at a high quality, rather than trying to maintain FPS and having each of those frames have a lower perceived resolution. This is because a Multi user is often sharing text, which quickly becomes unreadable as the image quality gets worse.

Jitter

As we experimented with encoder settings, we were surprised at how high the machine-to-machine latency was for a P2P connection. We figured we must be missing something significant, but in the intimidating mountain of WebRTC configuration parameters, we hadn’t found anything that was able to get us down to the sub-100ms numbers we were looking for. Eventually, we stumbled upon the jitter and playout delay settings. Vastly simplified, these settings are used by the video receiver to try to smooth out rendering across received frames. This means that the received may not render a frame immediately if it believes that delaying that frame would make for a “smoother” experience.

By changing the following, we essentially disabled this feature, causing the receiver to render the frame as soon as possible, without the jitter buffer.

- max_playout_delay_(TimeDelta::Seconds(10))
+ max_playout_delay_(TimeDelta::Zero())

For p50, we saw ~90ms in latency reduction — the biggest change we had made since moving from a server-based to a P2P solution!

Quantization Parameter (QP)

Lowered quantization increased legibility at the expense of latency

QP stands for Quantization Parameter, a crucial setting in video encoding that directly influences compression level and, by extension, video quality and bandwidth usage. A lower QP means less compression, resulting in higher quality and higher bandwidth usage. Conversely, a higher QP increases compression, which can save bandwidth at the cost of quality.

The side effects of adjusting QP are significant. A higher QP can lead to a lower perceived resolution due to increased compression artifacts, while a lower QP, although offering better quality, demands more bandwidth. The default settings provided by encoders often aim for a middle ground, but in the case of Multi, we found that the encoder was far too willing to produce low-quality images, meaning the max QP number was much too high.

By default, the WebRTC VP9 encoder has min and max QP numbers of 8 and 52 while screen sharing. We found that we got much clearer results by changing these to 4 and 36 respectively. Also, there’s an additional parameter (the “undershoot percentage”)

In Multi, adjusting QP turned out to be the most significant factor in the legibility of our screen share stream.

Conclusion

After adjusting our encoder settings, we finally reached the upper right of our matrix:

Our shared control streams are now 2.3x faster, while remaining sharp and highly legible. And we have plans to increase our resolution even further in the future — stay tuned!

If you’re also using WebRTC, we’d love to hear your experiences. Have you reached the same conclusions we have? Adjusted different parameters? Drop us a line at (john|fernando) @ multi.app or (@jnpdx|@fbarbat) on Twitter.

Lastly, if you're interested in more, check out Part 1 of this blog series, which covers how we instrumented and analyzed latency so that we could rapidly iterate on the above changes, despite tremendous noise in the data. Or test it out by signing up for Multi!

Subscribe to Remotion email updates

Subscribe to Multi email updates

Engineering

John Nastos, Fernando Barbat

Mar 5, 2024

Making Illegible, Slow WebRTC Screenshare Legible and Fast