WebVTT and Audio

←Back to listing

I have written a number of posts in the past on WebVTT and video subtitles it has always struck me that the technology is surely just as useful for HTML5 audio and it is for video, so I decided to look into its support.

Of course WebVTT stands for Web Video Text Tracks, so you could argue that it shouldn’t work for audio tracks at all, but since the whole point is to make video more accessible, why not extend this to audio as well?

The first thing I did was to test if any of the major browsers (Firefox, Chrome, Opera, Edge) natively support the displaying of WebVTT text tracks that are associated with an <audio> element. But none of them do, they all completely ignore any <track> content.

This is a shame, as both the audio and video elements are extensions of the media element, and therefore what one supports, another should also. Of course I know nothing about browser internals so I will say no more on this.

One possible solution could be to use something like WebVTT Transcript which could be used to simply read the contents of the required WebVTT file and to display them, in their entirety, on the page.

Or, we could simply use a <video> element to play our audio.

Yep, the <video> element will also happily accept audio files in its <source> elements and we can use that to display our WebVTT contents for the audio file.

<video controls width="500" height="100">
   <source type="audio/mp3" src="arth_ire8.mp3">
   <track label="English" kind="subtitles" srclang="en" src="audio-en.vtt" default>

Bearing in mind the current status of browser’s implementation of WebVTT, some work is required in giving the <video> element a fixed size so that the WebVTT content can be displayed, and setting the cue font size is also required as otherwise the text is too small (unless you set the <video> element to have bigger dimensions) but this works best in WebKit browsers, as Firefox’s current implementation ignores all attempts to change the font size of the cue text, and none of the browsers would respond to caption positioning – as putting the captions on the top of the container would be best.

video {
   border: 1px solid #aaa;
   object-fit: initial;
::cue {
   font-size: 12px;

It is also necessary to set the object-fit setting to initial for WebKit browsers, as otherwise the internal black box that is added for all video content does not fit within the dimensions of the box that we have defined.

I have created a simple example of this method for WebVTT and Audio.

Until browsers decide to implement WebVTT for <audio> elements, using <video> appears to be the way forward.