Fast Free Speech Recognition using Google's Infrastructure
A few of evenings ago, after a couple of glasses of red wine, I was wondering how Chrome's new(ish) x-webkit-speech voice input tag worked.
Scalable, reliable, speech-to-text in opensource environments is irritatingly hard to do and it seemed (as with spam filtering) that Google had done a lot of the hard work already here, and had a vested interest in it being accurate, performant and maintained.
A bit of digging through Chromium's source revealed a public API that will take audio data via HTTP and turn it into a JSON response. Cool.
A bit more digging revealed the expected audio formats and a few more internal details about the structure of the HTTP request.
const int SpeechRecognizer::kAudioSampleRate = 16000;
const int SpeechRecognizer::kAudioPacketIntervalMs = 100;
const ChannelLayout SpeechRecognizer::kChannelLayout = CHANNEL_LAYOUT_MONO;
const int SpeechRecognizer::kNumBitsPerAudioSample = 16;
const int SpeechRecognizer::kNoSpeechTimeoutSec = 8;
const int SpeechRecognizer::kEndpointerEstimationTimeMs = 300;
// ...
const char* const kContentTypeSpeex = "audio/x-speex-with-header-byte; rate=";
const int kSpeexEncodingQuality = 8;
const int kMaxSpeexFrameLength = 110; // (44kbps rate sampled at 32kHz).
// ...
const char* const kContentTypeFLAC = "audio/x-flac; rate=";
const int kFLACCompressionLevel = 0; // 0 for speed
All pretty straight forward - the API expects a HTTP POST of a 16bit, 16Khz mono audio stream encoded as either FLAC or Speex binary data.
The first step is generating an appropriate audio stream. Sox is a great open source tool that lets you play, record and convert audio between a variety of formats. Recording is as easy as rambling into the microphone after running:
$ rec -r 16000 -b 16 -c 1 test.wav
Looking at the Chromium source, there is quite a lot of effort spent on normalizing and trimming the audio prior to uploading - Thankfully, sox also provides this functionality.
To convert to FLAC, trim and normalize is:
$ sox test.wav test.flac gain -n -5 silence 1 5 2%
Breaking that down:
- sox test.wav test.flac (convert to FLAC format)
- gain -n -5 (normalize audio to -5 db)
- silence 1 5 2% (trim silence from beginning/end of file based on quietness threshold)
You can also convert to Speex format the same way.
Pulling the API details out of the source, it's quite easy to perform a transcription request using curl to submit the binary data to the API endpoint:
#!/bin/bash
# FLAC encoded example
curl \
--data-binary @example.flac \
--header 'Content-type: audio/x-flac; rate=16000' \
'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&pfilter=2&lang=en-US&maxresults=6'
# Speex encoded example
curl \
--data-binary @example.spx \
--header 'Content-type: audio/x-speex-with-header-byte; rate=16000' \
'https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&pfilter=2&lang=en-US&maxresults=6'
And voila, the output:
{
"hypotheses": [
{
"confidence": 0.88569070000000005,
"utterance": "this is pretty cool"
},
{
"utterance": "thesis pretty cool"
},
{
"utterance": "this is spirit cool"
},
{
"utterance": "this ease pretty cool"
},
{
"utterance": "this is pretty kool"
}
],
"id": "48e33aa5b2e21889c66f69ae492307b7-1",
"status": 0
}
There are some great hacks and integrations you could do with this. Things I haven't tried:
- Determining maximum length of audio transcription
- Testing with low-quality audio (ie: telephone)
- Playing with normalization/settings to maximize quality
- Any sort of streaming-batching to attempt (close to) realtime transcription
Any other (semi) secret Google APIs I should know about?
http://fennb.com/fast-free-speech-recognition-using-googles-in
댓글 없음:
댓글 쓰기