Skip to content

PoC 0latency

Romain Vimont requested to merge 0latency into master

0latency

VLC is designed to process packets and frames according to their timestamp (PTS). This implies that it needs to wait a certain duration (until a computed date) before demuxing, decoding and displaying. The purpose is to preserve the interval between frames as much as possible, so at to avoid stuttering when watching a movie for example.

Real-time mirroring

Before making any change, we must be able to test glass-to-glass latency easily. For that purpose, we can mirror an Android device screen to VLC.

Download the latest server file from scrcpy, plug an Android device, and execute:

adb push scrcpy-server-v1.25 /data/local/tmp/vlc-scrcpy-server.jar
adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
    app_process / com.genymobile.scrcpy.Server 1.25 \
    tunnel_forward=true control=false cleanup=false \
    max_size=1920 raw_video_stream=true

(Adapt max_size=1920 to use another resolution, that impacts the latency.)

As soon as a client connects to localhost:1234 via TCP, mirroring starts and the device sends a raw H.264 video stream.

It can be played with:

./vlc -Idummy --demux=h264 --network-caching=0 tcp://localhost:1234

By playing a mire test on the device, and taking a picture (with a good camera) of the device next to the vlc window, we can measure the glass-to-glass delay.

Note that this delay includes the encoding time from the mobile device, which may be larger that the target hardware.

On master

On VLC4 master without any change, the result is catastrophic (VLC is not designed to handle this use case):

before

The video is 30fps, and each increments represent 1 frame, so 30 frames represent 1 second. At the end of this small capture, there is almost a 10 seconds delay.

PoC

To mirror and control a remote device in real-time, the critical objective is to minimize latency. Therefore, any unnecessary wait is a bug.

Concretely, all waits based on a timestamp must be removed. Therefore, in 0latency mode, clocks become useless and timestamps are irrelevant. Also, buffering must be removed as much as possible.

To that end, this PoC changes several parts of the VLC pipeline.

Global --0latency option

The first commit adds a new global option --0latency, that will be read by several VLC components. By default, it is disabled (of course).

To enable it, pass --0latency:

./vlc -Idummy --0latency --demux=h264 --network-caching=0 tcp://localhost:1234

Picture buffer

In VLC, when a picture is decoded, it is pushed by the decoder to a fifo queue, which is consumed by the video output.

For 0latency, at any time, we want to display the latest possible frame, so we don't want any fifo queue.

This PoC introduces a "picture buffer" (vlc_pic_buf; yes, this is a poor name), which is a buffer of exactly 1 picture:

  • the producer can push a new picture (overwriting any previous pending picture);
  • the consumer can pop the latest picture, which is a blocking operation if no pending picture is available.

The producer is the decoder. The consumer is the video output.

Video output

In VLC, the video output attempts to display a picture at the expected date, so it waits for a specific timestamp. This is exactly what we want to avoid for 0latency.

If 0latency is enabled, this PoC replaces the vout thread function which does a lot of complicated things by a very simple loop (Thread0Latency()):

  1. pop picture from the picture buffer;
  2. call vout prepare();
  3. call vout display().

The function vout_PutPicture() is also adapted to push the frame to our new picture buffer instead of the existing picture fifo.

Note that in this PoC, the picture is not redrawn on resize, so the content will be black until the next frame on resize. That could be improved later.

Input/demux

In VLC, the input MainLoop() calls the demuxer to demux when necessary, but explicitly waits for a deadline between successive calls. We don't want to wait.

Therefore, this PoC provides an alternative MainLoop0Latency(), which is called if 0latency is enabled. This function basically calls demux->pf_demux() in a loop without ever waiting.

Some code in the es_out (on control ES_OUT_SET_PCR) based on clock (for handling jitter) is also totally bypassed.

Decoder

When the decoder implementation has a frame, it submits it to the vout via decoder_QueueVideo(). The queue implementation is provided by the decoder owner in the core, which handles preroll and may wait.

This PoC replaces this implementation by a simple call to vout_PutPicture(), to directly push the picture to our new picture buffer in the vout. If the vout was waiting for a picture, it is unblocked and will immediately prepare() and display().

On the module side, the avcodec decoder was adapted to disable dropping frames based on the clock (if a frame is "late"), and to enable the same options as if --low-delay was passed.

H.264 AnnexB 1-frame latency

The input is a raw H.264 stream in AnnexB format (this is what Android MediaCodec produces). This raw H.264 is sent over TCP.

The format is:

(00 00 00 01 NALU) | ( 00 00 00 01 NALU) | …

The length of each NAL unit is not present in the stream. Therefore, on the receiving side, the parser detects the end of a NAL unit when it detects the following start code 00 00 00 01.

However, this start code is sent as the prefix of the next frame, so the received packet will not be submitted to the decoder before the next frame is received, which adds 1-frame latency.

However, the length of the packet is known in advance on the device side. Therefore, a simple solution is to prefix the packet with its length (see Reduce latency by 1 frame).

For simplicity, for now I reused the scrcpy format, by requesting the server to send frame meta:

adb forward tcp:1234 localabstract:scrcpy
adb shell CLASSPATH=/data/local/tmp/vlc-scrcpy-server.jar \
    app_process / com.genymobile.scrcpy.Server 1.25 \
    tunnel_forward=true control=false cleanup=false max_size=1920 \
    send_device_meta=false send_frame_meta=true send_dummy_byte=false
#                          ^^^^^^^^^^^^^^^^^^^^

I wrote a specific demuxer to handle it: h264_0latency

To use it, replace the --demux= argument:

./vlc -Idummy --0latency --demux=h264_0latency --network-caching=0 tcp://localhost:1234

To make the difference obvious, I suggest to play a 1-fps video.

With all these changes, the latency is reduced to 1~2 frames (30 fps) glass-to-glass:

0latency_poc2

(the device is on the left, VLC is in the middle, scrcpy is on the right)

Protocol discussions

For this PoC, the video stream is received over TCP from an Android device connected via USB (or via wifi on a local network), using a custom protocol.

Packet loss is non-existent over USB and very low on a good local wifi network. However, packet loss would add an unacceptable latency over the Internet with a protocol taking care of packet retransmission (like TCP).

The following is some random thoughts.

Ideally, I think that:

  • we want to never decode a non-I-frame packet when the previous (referenced) packets are not received/decoded (this would produce corrupted frames)
  • we want to skip any previous packets (possibly lost) whenever a I-frame arrives

Concretely, the device sends:

 [I] P P P P P P P P P P P P P P P [I] P P P P P P P P P P P …

If a packet is not received:

 [I] P P P P P _ P P P P P P P P P [I] P P P P P P P P P P P …
               ^
             lost

then one possible solution:

  • the receiver does not decode further P-frames until the missing packet is received;
  • if a more recent I-frame is received, it starts decoding it immediately and forgets/ignores all previous packets.

As a drawback, this forces to use small GOP (i.e. frequent I-frames).

To be continued…

Edited by Romain Vimont

Merge request reports