Deep Learning for Amharic speech recognition – Part 2

Some more examples after training on a bigger dataset. Everything else (model size, hyper-parameters, and so forth) have been kept the same. WER stands for Word Error Rate. Samples were taken from the training set, not validation/test.

Truth: የሚ ወጡት ኤምባሲው ውስጥ የሚሰሩ አባላት ና ጥገ ኞቻቸው እንደሆኑ አሜሪካኖቹ አስታውቀ ዋል
Predicted: የሚ ጡት ምባሲ ውስጥ የሚ ሰርአባላት ነው ትገንቻቸው እንዱሙ አሜሊካንቹ አስ ታውዋል
WER: 19
Truth: አስር ሺ ኢትዮጵያዊያ ንም ቤት አልባ ሆኑ
Predicted: አስር ሺ ኢትዮጵያውያ ንም ቤት አልባ ሆኑ
WER: 1
Truth: ያን ያህል ተዳርሰ ን እንወስ ና ለ ን እንጂ ውሳኔ ለ ማግኘት ወደ በላዩ የ ኢትዮጵያ ፉትቦል ፌዴሬሽን አን ሄድ ም
Predicted: የን የህል ተዳርሰንወስና ለን ንጅ ውሳኔለ ማግት ተወደ በላ ይ ኢትዮጵያ ፉት ቦል ፌዴድ ሽን ንደው
WER: 24
Truth: ሰውዬው ሚስቱ ን በጣም እንደሚ ወዳት በ ለ ሆሳ ስ ነገራት
Predicted: ሰውዬው ሚስቱ ን በጣም እንደሚ ወዳት በ ለሆሳስ ነገራት
WER: 2
Truth: በ መከላከያ ሚኒስቴር ሹም ሽር እና እስር እየተካሄደ ነው
Predicted: በመካላኪያሚቴርሹም ሽር ና ስር እየ ተካሄዳ ነው
WER: 11
Truth: ኢህ አ ድግ በ ርእሰ ከተማይቱ አዲስ አባ ሀያ ሰባት ወረዳ ዎች በ እርግጠኝ ነት እንደሚ ያሸንፍ እንደሚ ተማመን ገልፇ ል
Predicted: ያድግ በርሰ ከተማይቶ አዲ በባአያ ሰባቶ ረዳዎች በ ርግተኝነት እንደሚያሸነፍ እንደሚ ተማመንገልፇል
WER: 22
Truth: ለምን ስ እና ንገላታ ቸዋለን የሚ ል አስተያየት ሰንዝ ረዋል
Predicted: ለምንስ ና ንገላታቸዋለን የሚ ላ አ ሰያት ስንዝረዋል
WER: 9
Truth: ግን እሱን ከ ሶስት አመት በፊት ሸ ጬ ዋለሁ
Predicted: ግያ ሱም ከሶስት አመት በፊሸጩነሁ
WER: 11
Truth: አህዛብም ከጥፋት ውሀ በኋላ
Predicted: አህዛብም ከጥፋት ሀ በኋላ
WER: 1
Truth: በዚያች አመትም ስለ ከብቶቻቸው ሁሉ ፋንታ እህልን መገባቸው
Predicted: በዚያች አመትም ስለ ከብቶቻቸው ሁሉ ፈንታ እልን መገባቸው
WER: 2

Deep Learning for Amharic speech recognition

Research in speech recognition for English and Chinese has shown impressive progress recently. However the only product I found for Amharic was Google’s cloud service. It supports Amharic and works surprisingly well. Nevertheless, it would be wrong to conclude the problem had been solved as we don’t know Google’s methodology and evaluation of it. On top of that, their work is proprietary. There needs to be an open research to speed up progress and make it open for wider use.

This post explains how I applied Deep Recurrent Network to train a speech recognizer for Amharic. I’ll share details of my approach, dataset, and results. Knowledge of Deep Learning and RNNs is beneficial although not necessary. All my work is available on github.

The challenge

First, it’s helpful to understand what we are up against. Research in speech recognition started way back in the 1950s. Why did it take too long? It turns out there is variation in how we pronounce words which makes recognition tricky. Let’s see an example. Here is a recording of the same speaker saying the same word.

 

Let us now plot the audio signals. These are basically one dimensional array of numbers. We see the two clips have different signals.

speaker1

And here is a second speaker, this time female, saying the same word.

speaker2

The second speaker’s recording looks even different from that of the first. To make the problem worse, there are different accents, some people speak fast while others a bit more slowly, our pitch varies, there can be background noise, and so forth. If the speaker is viewable, we humans pay attention to mouth movements and gestures to aid the listening process. We also consider context – if we didn’t hear a word very well, we fill in the gap by considering what the speaker said before and after.

Data

Data for Amharic speech recognition is extremely difficult to come by. There is plenty of audio data in Amharic but very little transcription of it. Luckily, a friend pointed me to Getalp. They prepared about 10k audio clips totalling 20 hours of speech. Each clip is a few seconds long and comes with its transcription. There are a few hundred separate clips for validation purposes as well.

Features

Key to any machine learning method is extracting features from data. Features represent data and serve as input to the learner. Speech recognition methods derive features from audio, such as Spectrogram or Mel Frequency Cepstrum (MFCC). Audio is split into small blocks, such as 10 milliseconds, and each block is broken into its constituent frequencies. A good example is musical chords. Listen to the following audio.

Three notes are played separately and then all three are played simultaneously. The last piece of sound, called a chord, is similar to what we hear in speech.

I chose to use MFCC over Spectrogram because it produces fewer number of features. Spectrogram can work even better than MFCC as it gives more features. The downside is it needs more data and longer time to train. What does MFCC look like? Let’s take a few seconds of audio clip and plot it’s MFCC features. The result is shown below. Time is on the Y axis. Each horizontal line represents a short audio clip of 10 ms duration. The 13 columns (0-12) represent what’s known as MFCC coefficients in signal processing. Here we simply refer to them as MFCC features.  Colors are intensities. Since I chose to use 13 coefficients, each 10ms audio is represented by an array of 13 numbers.

mfcc

You can read further on MFCC and the reason for choosing 13. I suggest you don’t spend much time on though, because Deep Learning researchers are interested in having the network learn features by itself instead of spending time researching on how to pre-process data for it.

Deep Learning and Recurrent Neural Networks

The strength of neural network based solutions lies in the fact that they can be used as generic learning methods to any problem. It has been proven by their application to self driving cars, image recognition, healthcare, and machine translation to name a few. I used Deep Learning (DL) and Recurrent Neural Networks (RNN) because it has shown excellent results and tech companies including Google, Amazon and Baidu use DL for speech recognition.

neural-network

The core idea is to have a network of interconnected nodes (also known as Neural Networks) where each node computes a a function and passes information to the nodes next to it. Every node computes it’s output using the function it was configured to use, and some parameters. The process of learning updates these parameters. The function itself will not change. Initially the parameters are randomly initialized. Then training data is passed through the network, making small improvements to the parameters on every step.

If it all looks like magic, you are not alone! In fact, even leading researchers in the field experiment with different functions and network layouts to see which one works best. There are some well understood functions and network architectures. In case of speech recognition, Recurrent Neural Network (RNN) are used as their output depends not just on the current set of inputs, but on previous inputs too. It’s crucial in speech recognition because predicting what has been said at a particular window of time becomes much easier if what has been said before is known. If you need learning material, there are excellent Machine Learning and Deep Learning courses on Coursera.

Connectionist Temporal Classification

Earlier we saw how audio files are broken down into 10 ms chunks and each block represented by an array of features. We then feed each chunk to the network one by one and the network would predict, sometimes called emit, letters. There’s one problem though. When we speak, we don’t say out letters in fixed time durations. Letters are stretched in time as the image below shows.

ctc1

We solve it by Connectionist Temporal Classification (CTC). In CTC, all the መ’s shown in the image above would be collapsed into a single character, and similarly for “ሄ” and “ድ” characters. What if the word spoken was “መመከር” and we don’t want to collapse the two መ’s into a single “መ”, which would give the wrong prediction “መከር”? For this, CTC introduces a special blank character. Let’s call it B for short. If the network thinks another character was spoken, whether it’s the previous character repeated or a different one, then it emits B. It’s distinct from the space character which is emitted if there is a long silence,  normally heard between words. All in all, the network predicts all Amharic characters in the training transcription, B and space.

What we really get during prediction is probabilities for each character. So we end up with a matrix showing probabilities for each character on each time step. Let’s see an example. Listen to the following clip.

I will go into detail about how I built the machine learning model itself. For now though, take a look at the predicted probabilities. Since the full prediction matrix is too big, I have removed predictions for most of the characters except the interesting ones and removed all time steps except the ones most relevant to the first word.

probabilities

In the max column we simply take the character with the maximum probability in each time step (row). It’s called a greedy approach.

Although greedy approach worked just fine in our particular example, it’s not the ideal solution. For example “ያያB”, “ያBB”and”BያB” are three paths which all result in the same prediction “ያ” after collapsing. So what we really need is to find a more holistic probability which considers all the paths that would lead to the same transcription. For example, for the first time step if “ላ” turned out to have the highest probability, a greedy approach will simply pick “ላ” without comparing the overall probability of all paths that start with “ላ” with those that start with “ያ”.

The ideal solution would take the entire sequence of characters that could ever be predicted in the duration of the clip. Then for each it would find all the paths that lead to that particular sequence of characters. Finally, the probabilities for each paths are summed up. The result would be probabilities for each sequence of characters in the entire time step. We then simply pick up the character sequence with the maximum probability. This operation has exponential complexity. There are 224 characters in my setup. Let’s take a 5 second clip and assume we split each second into 100. We’re end up with 224^500 paths to process! It’s not feasible at all. So we do a beam search. Instead of considering all characters, we remove the ones that have too little probability to be considered. We also look ahead in shorter time steps instead of the entire clip.

To sum up, CTC allows us to take probabilities from the network and decode them into a sequence of characters.

Acoustic Model

We have pre-processed audio signals to get MFCC features, and worked out how network predictions can be decoded into letters. Now let’s see what happens in the middle. We need to associate audio signals, represented by MFCC features, with Amharic characters. This part is known as acoustic modelling.

Model 1

The first model I tried had one RNN layer with Gated Recurrent Unit (GRU) followed by normalization. Finally, there’s a softmax layer to output letter probabilities.

model_1

Predicted text was nowhere close to the truth. Here is a typical example.

Truth: ያንደኛ ደረጃ ትምህርታቸው ን ጐንደር ተ ም ረዋል
Predicted: ጌጭሹጭካጌካጐጭቋጭካቋካሶካዤካጪኰጪዊኰጪዊሩሸሩሸቋጪቋጌቼኮጼዛሙጪኰጣኰእኰላጼላጼኰጪህችህጪኰጪኰኩጼጪጧጽቀአእሸሄኬሄአሩህሄህጽጪሹጪሹጪዊአምአ’ዊሸሹራዊዚኙኩጯኩኬጃሹካጪካዊኰአዋኰሹይካጪኰጪሄሙኋሙኛጼጪኰጪኰጪኰጪቆቮሙ’አላሊላሊአሙአሙአሙሩዊዤካዤካቿዊፉዊዋዊዚኩጼዚዋሹአሹዋሙዋሙዋሹአቋዤኬሄርሸሩሸ’ሸቋጃቋጭቋጭጪጭቋካቋሄጭቋጭቋጭ

Model 2

In the second model I took inspiration from Baidu’s Deep Speech research. One of their improved architecture is Deep Speech 2. Although I haven’t exactly replicated their network, my aim was to have a network a bit similar to theirs. I put a convolutional layer followed by several GRU layers.

model_2

Due to the size of the network, I used GPU instances on Amazon AWS for faster experimentation then ended up running the final training on my laptop. My hardware was Intel i7 8 core CPU. Training consumed about 5 cores and took 3 days to do 40 epochs (rounds) through the data. Investing on GPU would be more appropriate although I feel there are other high priority tasks as I will discuss under Improvements section. Here is how the training history turned out.

model2-training-history

Since validation error increased from epoch 20 onwards, I recorded the model on epoch 22 and evaluated the predictions. Performance on training data was encouraging. Here are two examples.

Truth: ያንደኛ ደረጃ ትምህርታቸው ን ጐንደር ተ ም ረዋል
Predicted: ያንደኛ ደጃት ምርታቸው ቆንዳር ተምረዋል

Truth: ወሬው ን ወሬ ያደረጉ ምስጢረ ኞች ናቸው
Predicted: ወሬ ውድ ወሬ ያደጉ ምስጥረኞች ናቸው

Here is another example, this time on validation data (data not seen by the model before).

Truth: በ ብዛት ግን በ የ መንገዱ ዳር በ የ በረንዳ ው ላይ ም ለ ማደር በ መከልከላቸው ማይ ሀብ ሀብ የተባለ የ ከተማው ቆሻሻ ማጠራቀሚያ ቦታ የሚያድሩ መሆኑን ገልጠዋል
Predicted: ስክእ ብዛት ብንበ የ መንዱዳር በ የ በረን እና ላይን ን ብለ ማደር በ መከል ካ ላቸው ማ ራ ሀ በተባለከት ማቆሻሽ መተረቀ መያበጣ የሚ ያጡ መሆኑ ን ጋልተዋል

It’s far from perfect but it has achieved one of the first steps in machine learning, which is to have a model capable of fitting training data. Word Error Rate (WER) is one way of evaluating speech recognition models. You can find my analysis of that on github as well.

The full picture

To summarize, here is the full picture of my approach.

speech-recognition-full-picture

Improvements

Going forward, there are several interesting venues to build up on. Let me list a few in order of priority.

1. Data
3k hours of speech used to be considered sufficient for training. Big tech companies now use up to 100k hours and beefy hardware. Only 20 hours of data is available for Amharic. More data is needed with different speakers and text. Data can also be synthesized. For example an audio can be duplicated and background noise added on the duplicated data. It helps the model become robust on audio with noise.

2. Language Model
State of the art systems use DL together with a Language Model (LM). Wouldn’t it be nice to take the predicted text and pass it through a spell correction tool? LMs go even further. Let’s say the network is confused between “ተስፋ አለው” and “ተስፋ አለሁ”. It can well decide to predict “ተስፋ አለሁ” if it has the highest probability. A LM would be trained separately on Amharic corpus and no speech is involved. It will know that “ተስፋ አለው” is more probable than “ተስፋ አለሁ”. As opposed to a spell corrector which works after a bunch of words have been written, LM in speech recognition would have a tighter integration with the network. The network would predict probabilities based on speech, and the LM would help predict words. If you think about it, humans do something similar to hear speech as well. Ideally we wouldn’t use LM at all and the network would predict the most likely sequence of words by itself. This is called end-to-end learning and the final goal of deep learning researchers is to get to a point where no pre and post processing is required. The idea is to have the network do pre-processing using it’s initial layers and post processing using the final layers. Doing so requires much larger training data because we would be training the model not only as to how to translate speech utterances into characters, but also how people normally fit words together in that language. So it’s easier to train on textual data separately and then use trained LM model to aid the acoustic model.

3. Model
We have proved to ourselves that the model is capable of fitting training data given to it. The next step is to look into regularization to have a good validation accuracy. This problem can benefit from having more data as well. Different network architectures can be investigated too.

4. Features
As mentioned under Language Model, the goal is to have end-to-end training. And so we wouldn’t even need any pre-processing. Baidu 2 goes a step in this direction by using spectrogram features. There is also some research on using audio signals (the raw data) although it’s still in its infancy.

Wrap up

That’s it! What seemed to be impossible until recently is now just around the corner. In a matter of years we can see Amharic recognition being widely used. Key to achieving this goal is Open Research. If you are interested in details please comment or head over to the repository where all my code is available. Thanks for reading.

Update

  • I’ve presented my work at AAU. You can find the slides here.
  • Follow up post can be found here.

Mozilla DeepSpeech demo

I had a quick play with Mozilla’s DeepSpeech. It’s a  speech recognition engine written in Tensorflow and based on Baidu’s influential paper on speech recognition: Deep Speech: Scaling up end-to-end speech recognition.

The first test it on an easy audio. Recognition was excellent.

➜  deepspeech models/output_graph.pb OSR_us_000_0060_16k.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.pb
2018-03-18 12:29:45.199308: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.470s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 3.867s.
Running inference.
the horse trotted around the field at a brisk pace find the twin who stole the pearl necklace cut the cord that binds the box tightly the red tape bound the smuggled food look in the corner to find the tan shirt the cold drizzle will help the bond drive nine men were hired to dig the ruins the junkyard had a moldy smell the flint sputtered and lit a pine torch soak the cloth and round the sharp or odor
Inference took 85.845s for 58.258s audio file.

And a harder audio with multiple speakers. This time it performed poorly. Words got concatenated into a long gibberish string as well. Increase your volume because this audio is very faint.

➜ deepspeech models/output_graph.pb spkr0.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.pb
2018-03-18 12:14:09.931176: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.440s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 3.872s.
Running inference.
thsalwilljustdoitofntheirfarersothengsimplewefheraway a lot of these dimensions then knew the right way i got something that has all these different masesleavegeenyowothavnoubeewolbeoigisallsortsofviscionsooutmhepolthegoman oundobentyaleininkinoyeihadefferadepraredyshouldbetepridagaitintrytgetdeplenfonhimweresetop for him he roomisitisreallyounderyouoaothinktheyonshidinseholsateayoegoi think we can get pepletocomerethitbutthessasyy osbicsfin volunteer should i pursue that ohdeathetlyyeasotheypay originally they decided not to do go in to the speech soutnutterweerthe'lstillbesoontovolteerf essesfai'comeilityeapeople that are not linguiswereaninaerayaisisabotwerditoelisoreelyeeladersaemlyutbetiio'geeouteadenyersasmeeveeemipunyathealfgethertheothertaetisistecao ng of some sort nowentatatdayihatatjusbefinyge else also atibuelsutorsthegolsandwhatyouknow you can get this and in a gooelyi'lthagotreusedmisstabutyouoappreciateyouknow you can tell him that youll give him a saidhmiytodthe transcripts when they come back koyyitanywaytohvetetransferncesotitiavetoaneiiharesontbletetoconcentabotdoingthebegivingthemtocee immediately because a esissousaiowthispanofstoer aybeofenomegredpornt transfofstageronnanactygivasheelereris owdthostigoboiliketastritflerinhelpotsaton transcoptsreyngphingsyebuteskingsareweninglead out in her a itspeckisityeligotissetingitoyourhonoriagepasigiflgotlios o i ah did bunch of our mivingand still doing a much of our mihingnomin the midst of doing the pefilsforonmin eaven hours to do to copeandit'letayeanotherwordidyoucompettivowel'sidpenstabidsoijustbitasalotodatanedscopedfoinplacinoutofthemovewithanatape ill be done like in about two hours and so a in that for wolbeabltoinworkbymoremeeingssohomrtingofthedodn'tisbothattaisdhatwoncancesonidispreeent eneratingof one alsoanorymonisgeneratingtaclon two hobbies what all soon onsideottewfoaisseplantytososimpllysavedtisyeseeenoheeesilivinganentospiceooknothateodesarethehebilesfombrocesneswhichaeregenneratymone otovethem and for the for a hundred forty hours set and so they they were to degbiesperfileand we had
Inference took 434.213s for 300.000s audio file.

One last difficult one with a single speaker:

➜ deepspeech models/output_graph.pb LDC93S1.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.pb
2018-03-18 12:06:25.093521: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.458s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 4.011s.
Running inference.
she had a ducsuotangresywathorerall year
Inference took 14.435s for 2.925s audio file.

Side notes

DeepSpeech currently supports 16khz. Other formats can throw the following error:

assert fs == 16000, "Only 16000Hz input WAV files are supported for now!"

Use ffmpeg to convert to 16khz. Example,

ffmpeg -i OSR_us_000_0060_8k.wav -acodec pcm_s16le -ac 1 -ar 16000 OSR_us_000_0060_16k.wav

Acknowledgements

I used audio files from these sources:



							

How to add Swagger UI to a Play application

Do you want to document your Play application with Swagger? You may have decided Swagger is the right choice for you but give it a try and you’ll face with poor documentation and issues with different Play versions. This post will go through setting up Swagger for a play application from scratch. If you’d rather cut to the chase and get started with a working setup then you can clone swagger-tutorial and update the sample-swagger project to your needs.

Background

The core of swagger is a specification on how to describe REST API. It specifies how an API should be described using the JSON format. Built around this is Swagger UI that can parse such a specification and provide a good presentation. It also allows for test calls within the UI which is very useful for development.

We will do the following

  • Create a bare Play app
  • Add a simple REST endpoint
  • Add Swagger annotation to the endpoint
  • Add swagger json endpoint
  • Setup Swagger UI
  • Use Swagger UI to make a test call

This post is tested on Play version 2.5.x.

Create a bare Play app

Let’s start off with a working Play application. If you already have a working Play application you can use that and skip to the next section. Bear in mind this post is tested on Play 2.5.x.

If you want to start off with a sample Play project, clone my repo,

git clone git@github.com:tilayealemu/swagger-tutorial

There are two projects.

sample-play: a bare Play appplication

sample-swagger: a Play application with one end point and a swagger doc for it. It has all the changes discussed in this post applied to it.

Build and run the project. You should be able see the root page like so.

sample-play

Add a simple REST endpoint

Let’s add a simple REST endpoint. We create a “hello” service that takes a name and responds “Hello <name>”

If you cloned the sample-play project add this to the HomeController (if you have your own controller then use that).

def sayHello(name: String) = Action {
  Ok(s"hello $name")
}

Then define the endpoint in your routes file.

GET        /hello/:name         controllers.HomeController.sayHello(name:String)

Now call the endpoint.

hello-endpoint

We are now ready to move to Swagger.

Add swagger dependencies to project

Before adding swagger annotation, add this dependency.

libraryDependencies += "io.swagger" %% "swagger-play2" % "1.5.3"

Then go to your controller, such as HomeController, and import some Swagger dependencies.

import io.swagger.annotations.{Api, ApiParam, ApiResponse, ApiResponses}
import play.api.mvc.{Action, Controller}

Annotate the controller class with @Api, and annotate the method too:

@Api
class HomeController extends Controller {

  @ApiResponses(Array(
    new ApiResponse(code = 400, message = "Invalid name supplied"),
    new ApiResponse(code = 404, message = "Name not found")))
  def sayHello(@ApiParam(value = "Name to say hello") name: String) = Action {
    Ok(s"hello $name")
  }
}

Build your code to ensure there are no compilation problems. We can now generate swagger.json that will contain this annotation.

Add swagger json endpoint

Go to application.conf and add this to enable Swagger

play.modules.enabled += "play.modules.swagger.SwaggerModule"

Then And in routes file add swagger.json. This will list resources in your service.

GET /swagger.json controllers.ApiHelpController.getResources

Then when you access http://localhost:9000/swagger.json you will get the specification for your endpoint.

swagger.json

This is the core of what Swagger is. Next up is to use Swagger UI for a pretty view of this specification.

Setup Swagger UI

Copy Swagger UI files from the main github repo to your project. You only need the dist directory.

git clone git@github.com:swagger-api/swagger-ui
mkdir swagger-tutorial/sample-play/public/swagger
cp -R swagger-ui/dist/* swagger-tutorial/sample-play/public/swagger

swagger-ui-files

Then update your route file. We will serve swagger under /docs/.

GET        /docs/               controllers.Assets.at(path="/public/swagger",file="index.html")
GET        /docs/*file          controllers.Assets.at(path="/public/swagger",file)

You can go ahead and visit localhost:9000/docs/ but what you will see is a documentation for a sample pet store REST API. Let’s update Swagger UI to use our swagger.json instead. Go to swagger/dist/index.html and change the pet store URL:

url: "http://localhost:9000/swagger.json"

You should now see your service listed.

swagger-ui

 

Use Swagger UI to make a test call

Click on the /hello/{name} endpoint. Click “Try it out”, type a name and click execute. You should get a response and 200 response code.

swagger-test-call

That’s it! You can head over to swagger annotations for more details on annotating services. Remember, if it doesn’t work for you, simply take sample-swagger and change it as you wish.

All comments welcome.

Emperor’s Proposition

A friend likes sending me puzzles at midnight. It’s called the Emperor’s Proposition and here it is (slightly modified but idea just the same),

You are in prison. The guard can set you free if you win a game. You will be given 100 black marbles, 100 white marbles and 2 bowls. You are allowed to place the marbles in any of the bowls, and you are not allowed to leave any marble outside. Once you are finished, the guard will choose a bowl at random and take out a single marble from it at random. If the marble is black, you win, if its white, you lose. What is the best way of distributing the marbles so you can win?

There are many pages with the answer and explanation of the conditional probabilities. Spoiler alert: answer coming up in next sentence. Best choice is to put a single black marble in one bowl and all others in the other. Here is how the probability varies as the composition of marbles is varied. We consider the composition in one of the bowls only because we have just 2 independent variables in this problem.

 

kings-preposition-probabilities

 

It’s quite a slippery surface 🙂 And contour graph,

kings-preposition-contour

Python code,


%matplotlib notebook
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import numpy as np

fig = plt.figure()
ax = fig.gca(projection='3d')

def div(a, b):
 return a/np.maximum(b, 1)

# make data
X = np.arange(0.0, 101.0, 1.0)
Y = np.arange(0.0, 101.0, 1.0)
X, Y = np.meshgrid(X, Y)
Z = 0.5*(div(X, (X+Y)) + div((100-X), (200-(X+Y))))

# surface plot
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
 linewidth=0, antialiased=False)
ax.set_zlim(0.0, 0.75)
ax.set_zlim3d(0.0, 0.75)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))
ax.set_xlabel('#black in bowl 1')
ax.set_ylabel('#white in bowl 2')
ax.set_zlabel('Probability of winning')
fig.colorbar(surf)
plt.show()

# countour plot
plt.figure()
CS = plt.contour(X, Y, Z)
plt.clabel(CS, inline=1, fontsize=10)
CB = plt.colorbar(CS, shrink=0.1, extend='both')
plt.show()

 

 

 

Visualising prime numbers

Prime numbers are a lot of fun. One area I think worth considering is to visualise them in different ways to find patterns.

For example mathematicians were recently excited by the finding that the last digit of prime numbers shows some interesting statistical properties. For centuries mathematicians have mulled over prime numbers. It is quite plausible that the next big step would be made by looking at their statistical properties and visualising them in different ways. Since they are stubbornly “random”, the challenge is to find transformations and visualisations that will let us see new patterns which we didn’t know existed. I put random in quotes because although they seem random, the underlying mechanism that generates them is quite precise.

So here are some fun images of them.

To make things easier I downloaded a file with a list of prime numbers this site. There are a total of 78498 primes in the file (or perhaps one less because the file contains number 1, which is not prime).

First the first 100 prime numbers in all their glory,

 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 
 89  97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 
 173 179 181 191 193 197 199 211 223 227 229 233 239 241 251 257
 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353
 359 367 373 379 383 389 397 401 409 419 421 431 433 439 443 449
 457 461 463 467 479 487 491 499 503 509 521 523 541

and the last 5 in my file,

 999953 999959 999961 999979 999983

Let’s plot the first 100 primes.

Welcome the misbehaving numbers. Let’s plot the difference between successive primes for all primes (remember, around 78 thousand of them). This is known as the prime gap.

The white lines are odd numbers. Since all primes except 2 are odd, the gap between them is always even. Distance between primes seems to be increasing generally but the bottom of the graph remains dark as we progress to the right. It shows that even for large primes, we find primes whose gap is as little as 2. This is actually related to one of the unanswered questions of primes – Are there infinite prime pairs with a gap of 2?

If you find the previous graph too dense here is another one for the first 100 primes.

How about we plot prime gaps as a 2D image? To arrange our 78k primes in more or less square image, we draw the prime gaps row by row, with 280 primes in each row.

There are so many questions we can ask on this plot. For example how frequently are the prime gaps themselves divisible by, say, 23?

It’s like watching the stars 🙂 Of course, the questions we’re asking may not be the right ones. But seemingly silly questions may produce interesting answers. For example who would have guessed that a number is divisible by 3 if the sum of its digits is divisible by 3? Numbers are full of surprises.
Let’s do one more. This one is inspired by Sierpinski triangle.
A common way of generating it is by recursively drawing a solid triangle and “removing” smaller triangles from it, then removing even smaller triangles from those, and so on. However such an object can also be generated by successively applying the logical XOR.
It works like this. Write as many zeros as you like and replace the middle zero with a 1. Say we start with 000010000. This will be our “seed”.
row1 ....1....
Next, compute the second row. For each position we want to compute, we take the position on the row immediately above it and compute xor between its adjacent digits. So second row will be 000101000.
row2 ...1.1...
We continue this process computing the bottom rows. The result is the interesting structure you saw above.
....1....
...1.1...
..1.1.1..
We can also start off this exercise with a different seed. How about starting with a row that represents even numbers as 1 and odds as zero? We take the numbers 1, 2, 3, 4, 5 …, 198, 199 and transform them to 0, 1, 0, 1, 0, …, 1, 0. Generate 100 rows and this beautiful pattern emerges.
How about prime numbers? After experimenting a little bit I decided to leave out 2 and start from 3. I took all prime numbers between 3 and 199 and encoded them as 0 and 1 just like above.
And now its raining triangles!
Could go on forever playing with numbers so better stop here. Hope you had fun like I did. All graphs are generated by a python script. You can access html version of my Jupyter notebook and its source code.

Find largest image in an html page

Recently ran into the problem of finding the largest image in an html page. This can be useful in page summarisation tasks for example.

Found this scala project which looked promising but failed to compile for me,
https://github.com/GravityLabs/goose

 

Here is a quick php script that does that.

https://gist.github.com/tilayealemu/2519f54222ad9a7d7ff2b294b2b32a09

Example call,

http://localhost/largest.php?url_without_http=bbc.co.uk

largest-image-php

Notes:

  • Pass url parameter without the “http://” part
  • The “is_suitable” function does additional checks such as images that seem to be banners
  • If you get bad results consider extending the function to suit your needs. For my personal use I have additional checks on the name and URL of the image.

Decision trees and SQL (Part 2 – SQL as decision tree)

In the previous post I talked about Decision Trees. In this post I will talk about how to represent decision trees in SQL.

But first, why would you want to do that? There are Weka (a data mining tool), Scikit Learn (the mainstream machine learning library at the moment), H2O or Spark MLlib (both rooted in big data). However, there are several steps you have to take,

  1. Feed your data to the library/framework
  2. Train
  3. Persist the model
  4. Figure out the model’s API
  5. Bundle your model with your application together with any dependencies it may have

Every step is a hurdle against quick development and deployment. If you have your data in a SQL-like database, wouldn’t it be nice to just have a SQL that expresses a decision tree? If the trained model is a SQL string with no dependencies then steps 3 to 5 are no longer needed.

An example

A popular data for decision tree is the Iris dataset. It was first gathered in 1936 and has since been analysed in numerous research papers. Each row in the dataset represents a sample flower. There are three different species of the Iris flower and all of them are equally represented in the dataset (so there is 50 of each).

iris

Here are the five columns,

  • SepalLength
  • SepalWidth
  • PetalLength
  • PetalWidth
  • Name

What we want to do is learn from the data to give a decision tree that can take the first four columns as input and predict the species name. I wrote a small script that uses scikit to learn a decision tree from this data and translate the decision tree to a SQL.

There are plenty of tutorials on the web about using scikit. So I’ll just point you to my small script that outputs the SQL. It uses scikit to train a decision tree on the 150 rows, takes the decision tree and translates its to SQL. Here is what the result looks like,

dtsql

It’s beautiful 🙂 Let’s look at it in detail a bit. The first case shows the Iris Setosa species must have a very distinctive petal length. The second case shows that if petal length is ambiguous then petal width is used to narrow things down.

By the way, the example is not far fetched. An interesting and recent example is how a farmer used machine learning to automatically sort Cucumbers. You can read about it on Google’s blog.

Thoughts on big data

  • If data is huge then a distributed solution becomes necessary.
  • The final output after a lot of churning is still going to be a decision tree, which can be expressed as SQL.
  • SQL can be ran not just on old relational databases, but on new ones too, such as Hive and MemSQL.

Final notes

Although NoSQL generated a lot of noise relational databases and SQL still kick ass. We will always deal with tabular data and SQL will remain a powerful language to express database operations. On the other hand, the further our data analysis tool live from our databases, the more mundane work there is to do. Machine learning doesn’t have to be all new technology. A simple SQL query could be just what you need.

Links

Iris dataset
Demo script
DT in scikit