Fast Free Speech Recognition using Google's Infrastructure

A few of evenings ago, after a couple of glasses of red wine, I was wondering how Chrome's new(ish) x-webkit-speech voice input tag worked.

Scalable, reliable, speech-to-text in opensource environments is irritatingly hard to do and it seemed (as with spam filtering) that Google had done a lot of the hard work already here, and had a vested interest in it being accurate, performant and maintained.

A bit of digging through Chromium's source revealed a public API that will take audio data via HTTP and turn it into a JSON response. Cool.

A bit more digging revealed the expected audio formats and a few more internal details about the structure of the HTTP request.

const int SpeechRecognizer::kAudioSampleRate = 16000;

const int SpeechRecognizer::kAudioPacketIntervalMs = 100;

const ChannelLayout SpeechRecognizer::kChannelLayout = CHANNEL_LAYOUT_MONO;

const int SpeechRecognizer::kNumBitsPerAudioSample = 16;

const int SpeechRecognizer::kNoSpeechTimeoutSec = 8;

const int SpeechRecognizer::kEndpointerEstimationTimeMs = 300;

// ...

const char* const kContentTypeSpeex = "audio/x-speex-with-header-byte; rate=";

const int kSpeexEncodingQuality = 8;

const int kMaxSpeexFrameLength = 110;  // (44kbps rate sampled at 32kHz).

// ...

const char* const kContentTypeFLAC = "audio/x-flac; rate=";

const int kFLACCompressionLevel = 0;  // 0 for speed

view rawgistfile1.cppThis Gist brought to you by GitHub.

All pretty straight forward - the API expects a HTTP POST of a 16bit, 16Khz mono audio stream encoded as either FLAC or Speex binary data.

The first step is generating an appropriate audio stream. Sox is a great open source tool that lets you play, record and convert audio between a variety of formats. Recording is as easy as rambling into the microphone after running:

$ rec -r 16000 -b 16 -c 1 test.wav

Looking at the Chromium source, there is quite a lot of effort spent on normalizing and trimming the audio prior to uploading - Thankfully, sox also provides this functionality.

To convert to FLAC, trim and normalize is:

$ sox test.wav test.flac gain -n -5 silence 1 5 2%

Breaking that down:

sox test.wav test.flac (convert to FLAC format)
gain -n -5 (normalize audio to -5 db)
silence 1 5 2% (trim silence from beginning/end of file based on quietness threshold)

You can also convert to Speex format the same way.

Pulling the API details out of the source, it's quite easy to perform a transcription request using curl to submit the binary data to the API endpoint:

#!/bin/bash

# FLAC encoded example

curl \

  --data-binary @example.flac \

  --header 'Content-type: audio/x-flac; rate=16000' \

  'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&pfilter=2&lang=en-US&maxresults=6'

# Speex encoded example

curl \

  --data-binary @example.spx \

  --header 'Content-type: audio/x-speex-with-header-byte; rate=16000' \

  'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&pfilter=2&lang=en-US&maxresults=6'

view rawgistfile1.shThis Gist brought to you by GitHub.

And voila, the output:

{

    "hypotheses": [

        {

            "confidence": 0.88569070000000005, 

            "utterance": "this is pretty cool"

        }, 

        {

            "utterance": "thesis pretty cool"

        }, 

        {

            "utterance": "this is spirit cool"

        }, 

        {

            "utterance": "this ease pretty cool"

        }, 

        {

            "utterance": "this is pretty kool"

        }

    ], 

    "id": "48e33aa5b2e21889c66f69ae492307b7-1", 

    "status": 0

}

view rawgistfile1.jsonThis Gist brought to you by GitHub.

There are some great hacks and integrations you could do with this. Things I haven't tried:

Determining maximum length of audio transcription
Testing with low-quality audio (ie: telephone)
Playing with normalization/settings to maximize quality
Any sort of streaming-batching to attempt (close to) realtime transcription

Any other (semi) secret Google APIs I should know about?

http://fennb.com/fast-free-speech-recognition-using-googles-in

It works but no one knows why.

2014년 10월 29일 수요일

Fast Free Speech Recognition using Google's Infrastructure

Fast Free Speech Recognition using Google's Infrastructure

댓글 없음: