Using the Web Speech API to control a HTML5 video

The JavaScript Web Speech API has been around since but has not really been implemented in any browser other than in Chrome, and even then only partially. I decided to take a quick look at this API and put together a demo on how it could be used to control a HTML5 video.

My first ever job was working as a software engineer for a voice recognition company, where we built voice recognition telephone systems for customers such as utility companies and banks (sorry) so voice recognition has always been of interest to me. Even though the Web Speech API has been around since the end of , I have only recently had the chance of looking at it and at Chrome’s implementation of it, which is quite simple since they have left the most complicated bits of it out (for now).

I’m not usually a fan of talking about WebKit or Chrome only implementations of things, but in this case I will make an exception.

Simple Speech Recognition

The first major questions to answer are who actually performs the speech recognition, and where are the recognisers located? Google and I’m not sure. The actual recognition is performed server-side, and is something you don’t have to worry about as Chrome will deal with it. This does appear to mean that the voice is either recorded or streamed from your browser to the server in question, which performs the recognition and returns the results. This can be done as follows:

// Create  the webkitSpeechRecognition object which provides the speech interface
var rec = new webkitSpeechRecognition();
// Ensure that the recogniser is listening continously, even if the user pauses (default value is false)
rec.continuous = true;
// Start recognising
rec.start();

This simple code snippet will cause the Chrome browser to start listening continously for speech and to attempt to recognise what was said. When the recogniser has some results, it returns data via an event which can be accessed via the onresult handler:

rec.onresult = function(e) {
   for (var i = e.resultIndex; i < e.results.length; ++i) {
      if (e.results[i].isFinal) {
      }
   }
}

The event object returned contains the following data:

The Web Speech API specifies that grammar objects can be defined and used with the speech recognition, but Chrome doesn’t appear to have implemented this yet, even though the webkitSpeechRecognition object does contain the attributes and methods for this, there is no documentation on how this might be used nor what formats are required. Because of this, at the moment the recogniser used by Chrome just performs general recognition on anything you say and you have to deal with the results yourself.

HTML5 Video Voice Control

I have put together a simple demo that allows HTML5 Video Voice Control with the Web Speech API that works in the latest Chrome. This simple demo listens for key words in the recognition string that then uses the HTML5 Media API to interact with the video.

Note: In the demo, you need to click on the “Start Recognition” button to start the recogniser. This will also cause Chrome to ask for user permisson to access the microphone which you should confirm.

The code for the demo is also available on GitHub.

Once the recogniser returns a set of results, the code looks for a keyword, in this case “video”, which then causes it to parse the recognised string some more looking for relevant commands. The commands that are available are:

As a command is correctly recognised and acted upon, it will be briefly highlighted in the command list.

The recogniser won’t always correctly recognise what was said, so be patient. If you want to see what the recogniser understood, the recognised result is written to the JavaScript console. Often you will see similar words being recognised, and since we are doing a simple indexOf() to check if a word was said, it doesn’t always work effectively. This is where being able to define a grammar would be very useful as it would try and match what was said to what the grammar defines, giving more accurate results.

Chrome’s implementation is useful as a proof of concept and I would like to see other browsers follow suit soon, and I look forward to the implementation of grammars.