Draft: stt: add speech-to-text capability with whisper.cpp (!4705) · Merge requests · VideoLAN / VLC

This is a new version of the Speech-To-Text implementation.

Differences from the last version:

What's new:

A new SPU fourcc VLC_CODEC_STT.
A new SPU decoder used as a subdecoder to decode audio frames to SPU.
A core interface to load an STT model asynchronously because the model can take several seconds to load and initialize accelerators if needed.

How it works:

Live:

Added functions in es_out to enable/disable STT.
Automatically creates an SPU ES when an audio track is selected.
When the new SPU track is selected, it starts a new thread to load the model. Once loaded, the track is selected. Simultaneously, it uses pts_delay to increase buffering to obtain the right amount of audio ready to be sent directly to the STT.
The audio track is sent to its STT sub-decoder for the audio frame just before sending it to the AOUT.
When the STT creates the new SPUs, it sends them to the SPU queue (decoder_QueueSPU).

Stream Output:

Created a new type of stream output, STT, to be used without transcoding.
As STT requires PCM, when a new audio track is added, it first loads the model, then creates an audio decoder and the STT SPU decoder, chaining them together.
Each new frame is first sent to the audio decoder and then to the SPU decoder.
As the SPU decoder returns SPUs asynchronously, it converts SPU to frames in VLC_CODEC_TEXT format before sending them to the next SOUT module in the chain.

Info:

For now, the user has to obtain the model they want to use themselves.
For macOS, iOS, and tvOS, there is a problem with the compilation of Whisper and some frameworks until I find a fix.
The WIP commit is because Whisper requires features that require a minimum of macOS 13.0. I don't know if there is a better solution than what I've done.

Draft: stt: add speech-to-text capability with whisper.cpp