Draft: stt: add speech-to-text capability with whisper.cpp
This is a new version of the Speech-To-Text implementation.
Supersedes !4468 (closed)
Differences from the last version:
- Removed the core interface of STT.
- Used an SPU decoder instead of an STT decoder and audio filters.
What's new:
- A new SPU fourcc
VLC_CODEC_STT
. - A new SPU decoder used as a subdecoder to decode audio frames to SPU.
- A core interface to load an STT model asynchronously because the model can take several seconds to load and initialize accelerators if needed.
How it works:
Live:
- Added functions in
es_out
to enable/disable STT. - Automatically creates an SPU ES when an audio track is selected.
- When the new SPU track is selected, it starts a new thread to load the model. Once loaded, the track is selected. Simultaneously, it uses
pts_delay
to increase buffering to obtain the right amount of audio ready to be sent directly to the STT. - The audio track is sent to its STT sub-decoder for the audio frame just before sending it to the AOUT.
- When the STT creates the new SPUs, it sends them to the SPU queue (
decoder_QueueSPU
).
Stream Output:
- Created a new type of stream output, STT, to be used without transcoding.
- As STT requires PCM, when a new audio track is added, it first loads the model, then creates an audio decoder and the STT SPU decoder, chaining them together.
- Each new frame is first sent to the audio decoder and then to the SPU decoder.
- As the SPU decoder returns SPUs asynchronously, it converts SPU to frames in
VLC_CODEC_TEXT
format before sending them to the next SOUT module in the chain.
Info:
- For now, the user has to obtain the model they want to use themselves.
- For macOS, iOS, and tvOS, there is a problem with the compilation of Whisper and some frameworks until I find a fix.
- The WIP commit is because Whisper requires features that require a minimum of macOS 13.0. I don't know if there is a better solution than what I've done.
Edited by Gabriel Lafond-Thenaille