Using the Web Speech API to control a HTML5 video

3 January 2014

The JavaScript Web Speech API has been around since October 2012 but has not really been implemented in any browser other than in Chrome, and even then only partially. I decided to take a quick look at this API and put together a demo on how it could be used to control a HTML5 video.

My first ever job was working as a software engineer for a voice recognition company, where we built voice recognition telephone systems for customers such as utility companies and banks (sorry) so voice recognition has always been of interest to me. Even though the Web Speech API has been around since the end of 2012, I have only recently had the chance of looking at it and at Chrome’s implementation of it, which is quite simple since they have left the most complicated bits of it out (for now).

I’m not usually a fan of talking about WebKit or Chrome only implementations of things, but in this case I will make an exception.

Simple Speech Recognition

The first major questions to answer are who actually performs the speech recognition, and where are the recognisers located? Google and I’m not sure. The actual recognition is performed server-side, and is something you don’t have to worry about as Chrome will deal with it. This does appear to mean that the voice is either recorded or streamed from your browser to the server in question, which performs the recognition and returns the results. This can be done as follows:

// Create  the webkitSpeechRecognition object which provides the speech interface
var rec = new webkitSpeechRecognition();
// Ensure that the recogniser is listening continously, even if the user pauses (default value is false)
rec.continuous = true;
// Start recognising
rec.start();

This simple code snippet will cause the Chrome browser to start listening continously for speech and to attempt to recognise what was said. When the recogniser has some results, it returns data via an event which can be accessed via the onresult handler:

rec.onresult = function(e) {
   for (var i = e.resultIndex; i < e.results.length; ++i) {
      if (e.results[i].isFinal) {
      }
   }
}

The event object returned contains the following data:

results[i] – an array containing recognition result objects. Each array element corresponds to a recognised word
resultIndex – this is the current recognition result index
results[i].isFinal – a Boolean that indicates if the the result is final or interim (interim results can be asked for via the interimResults attribute)
results[i][j] – a 2D array containing alternative recognised words. The first element is the most probable recognised word
results[i][j].transcript – the text representation of the recognised word(s)
results[i][j].confidence – the probability of the result given as being correct (float value from 0 to 1)

The Web Speech API specifies that grammar objects can be defined and used with the speech recognition, but Chrome doesn’t appear to have implemented this yet, even though the webkitSpeechRecognition object does contain the attributes and methods for this, there is no documentation on how this might be used nor what formats are required. Because of this, at the moment the recogniser used by Chrome just performs general recognition on anything you say and you have to deal with the results yourself.

HTML5 Video Voice Control

I have put together a simple demo that allows HTML5 Video Voice Control with the Web Speech API that works in the latest Chrome. This simple demo listens for key words in the recognition string that then uses the HTML5 Media API to interact with the video.

Note: In the demo, you need to click on the “Start Recognition” button to start the recogniser. This will also cause Chrome to ask for user permisson to access the microphone which you should confirm.

The code for the demo is also available on GitHub.

Once the recogniser returns a set of results, the code looks for a keyword, in this case “video”, which then causes it to parse the recognised string some more looking for relevant commands. The commands that are available are:

“video play” – plays the video
“video stop” – stops the video
“video replay” – returns to the start of the video and plays it
“video volume off” – mutes the video
“video volume on” – unmutes the video
“video volume decrease” – decrease the video’s volume by one
“video volume increase” – increases the video’s volume by one

As a command is correctly recognised and acted upon, it will be briefly highlighted in the command list.

The recogniser won’t always correctly recognise what was said, so be patient. If you want to see what the recogniser understood, the recognised result is written to the JavaScript console. Often you will see similar words being recognised, and since we are doing a simple indexOf() to check if a word was said, it doesn’t always work effectively. This is where being able to define a grammar would be very useful as it would try and match what was said to what the grammar defines, giving more accurate results.

Chrome’s implementation is useful as a proof of concept and I would like to see other browsers follow suit soon, and I look forward to the implementation of grammars.