I had a quick play with Mozilla’s DeepSpeech. It’s a speech recognition engine written in Tensorflow and based on Baidu’s influential paper on speech recognition: Deep Speech: Scaling up end-to-end speech recognition.
The first test it on an easy audio. Recognition was excellent.
➜ deepspeech models/output_graph.pb OSR_us_000_0060_16k.wav models/alphabet.txt models/lm.binary models/trie Loading model from file models/output_graph.pb 2018-03-18 12:29:45.199308: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Loaded model in 1.470s. Loading language model from files models/lm.binary models/trie Loaded language model in 3.867s. Running inference. the horse trotted around the field at a brisk pace find the twin who stole the pearl necklace cut the cord that binds the box tightly the red tape bound the smuggled food look in the corner to find the tan shirt the cold drizzle will help the bond drive nine men were hired to dig the ruins the junkyard had a moldy smell the flint sputtered and lit a pine torch soak the cloth and round the sharp or odor Inference took 85.845s for 58.258s audio file.
And a harder audio with multiple speakers. This time it performed poorly. Words got concatenated into a long gibberish string as well. Increase your volume because this audio is very faint.
➜ deepspeech models/output_graph.pb spkr0.wav models/alphabet.txt models/lm.binary models/trie Loading model from file models/output_graph.pb 2018-03-18 12:14:09.931176: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Loaded model in 1.440s. Loading language model from files models/lm.binary models/trie Loaded language model in 3.872s. Running inference. thsalwilljustdoitofntheirfarersothengsimplewefheraway a lot of these dimensions then knew the right way i got something that has all these different masesleavegeenyowothavnoubeewolbeoigisallsortsofviscionsooutmhepolthegoman oundobentyaleininkinoyeihadefferadepraredyshouldbetepridagaitintrytgetdeplenfonhimweresetop for him he roomisitisreallyounderyouoaothinktheyonshidinseholsateayoegoi think we can get pepletocomerethitbutthessasyy osbicsfin volunteer should i pursue that ohdeathetlyyeasotheypay originally they decided not to do go in to the speech soutnutterweerthe'lstillbesoontovolteerf essesfai'comeilityeapeople that are not linguiswereaninaerayaisisabotwerditoelisoreelyeeladersaemlyutbetiio'geeouteadenyersasmeeveeemipunyathealfgethertheothertaetisistecao ng of some sort nowentatatdayihatatjusbefinyge else also atibuelsutorsthegolsandwhatyouknow you can get this and in a gooelyi'lthagotreusedmisstabutyouoappreciateyouknow you can tell him that youll give him a saidhmiytodthe transcripts when they come back koyyitanywaytohvetetransferncesotitiavetoaneiiharesontbletetoconcentabotdoingthebegivingthemtocee immediately because a esissousaiowthispanofstoer aybeofenomegredpornt transfofstageronnanactygivasheelereris owdthostigoboiliketastritflerinhelpotsaton transcoptsreyngphingsyebuteskingsareweninglead out in her a itspeckisityeligotissetingitoyourhonoriagepasigiflgotlios o i ah did bunch of our mivingand still doing a much of our mihingnomin the midst of doing the pefilsforonmin eaven hours to do to copeandit'letayeanotherwordidyoucompettivowel'sidpenstabidsoijustbitasalotodatanedscopedfoinplacinoutofthemovewithanatape ill be done like in about two hours and so a in that for wolbeabltoinworkbymoremeeingssohomrtingofthedodn'tisbothattaisdhatwoncancesonidispreeent eneratingof one alsoanorymonisgeneratingtaclon two hobbies what all soon onsideottewfoaisseplantytososimpllysavedtisyeseeenoheeesilivinganentospiceooknothateodesarethehebilesfombrocesneswhichaeregenneratymone otovethem and for the for a hundred forty hours set and so they they were to degbiesperfileand we had Inference took 434.213s for 300.000s audio file.
One last difficult one with a single speaker:
➜ deepspeech models/output_graph.pb LDC93S1.wav models/alphabet.txt models/lm.binary models/trie Loading model from file models/output_graph.pb 2018-03-18 12:06:25.093521: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Loaded model in 1.458s. Loading language model from files models/lm.binary models/trie Loaded language model in 4.011s. Running inference. she had a ducsuotangresywathorerall year Inference took 14.435s for 2.925s audio file.
Side notes
DeepSpeech currently supports 16khz. Other formats can throw the following error:
assert fs == 16000, "Only 16000Hz input WAV files are supported for now!"
Use ffmpeg to convert to 16khz. Example,
ffmpeg -i OSR_us_000_0060_8k.wav -acodec pcm_s16le -ac 1 -ar 16000 OSR_us_000_0060_16k.wav
Acknowledgements
I used audio files from these sources:
- OSR_us_000_0060_8k.wav voiptroubleshooter.com
- spkr0.wav ee.columbia.edu
- LDC93S1.wav github.com/mozilla/DeepSpeech