Decision trees and SQL (Part 1 – Decision trees)

Decision trees and simplicity

Heavy clouds today? Then it will probably rain because we have, many times in the past, seen rain in cloudy conditions. But what if it is cloudy and we are in the middle of summer? Then we can say it is less likely to rain. Every machine learning method has its own approach to modelling how to learn from past data and represent that knowledge.

Decision trees do it in a straightforward way. It’s their simplicity that makes them attractive. Simplicity here is not about finding a mediocre solution in a short time with a vision of applying more complex and better solutions later. It’s about using a good enough method, and it’s about not overcomplicating things too much. A simple model can be all we need.

An example

Suppose we have data with four columns or attributes:

  • Accident reported (true/false)
  • Holiday (date)
  • Rush hour (time)
  • Congestion (true/false)

We try to use the first three attributes to predict the congestion column. Here is an example decision tree that does this. To make use of it, we start by answering the top question: accident reported or not? If an accident is reported we predict there will be congestion. If not, then we are still a bit unsure. So we ask more questions. Our tree is using two other attributes: whether the day is a holiday and whether the time of day is rush hour.

 

congestion-decision-tree-aisngitht-com

A few more interesting observations:

  • Accidents result in prediction of congestion, irrespective of holiday and rush hour data. This might be wrong. For example an accident on Christmas Day may not result in congestion. But there is nothing stopping us from extending the left branch of the accident node to do some more checks.
  • Our tree considers only three attributes to come to a conclusion. However, other attributes may need to be taken into account for better accuracy. Also, one or more of those being considered here (accident, holiday, rush hour) may have to be removed if it appears that they don’t help much in making predictions any more accurate.

Some interesting questions

How does a decision tree know in what order to put the nodes? For example why not check for rush our before checking for accident? Our tree puts accidents first because it believes accidents are a major factor. A decision tree comes to such conclusions by looking at data. Here is example data that will result in such a conclusion,

  • There was congestion 30% of the time, and
    • 90% of the time – accidents were reported
    • 50% of the time – it was rush hour
    • 1% of the time – it was a holiday

The decision tree algorithm concludes from this data that accidents are the best factor to split the data on. The split results on one big group being split into two smaller groups. The algorithm keeps splitting each group into smaller pieces until each group has a reasonable homogeneity. For example if a certain group is composed of 50% congestion and 50% non-congestion data points then the group must be split further because it is hard to decide who has ‘won’. If a group is composed of 98% congestion and 2% non-congestion, then it can be considered a good split and the whole group is assigned the congestion label.

And how do we know which attributes to use? There are three in our example. How about in other trees? The answer is multifaceted.

  • First of all, we are limited by available data. We stick to what we have or we do some work to obtain data that we think can help improve accuracy.
  • Next, we use domain knowledge. If we have one thousand attributes available to us then using all of them will probably result in a complex tree that doesn’t perform well. For example knowledge of London weather is probably useless for predicting congestion in New York. There is always risk in making such conclusions because data can sometimes show amazing results. On the other hand, we also have risk of overcomplicating our decision tree, also known as overfitting. So we choose the one with less risk, in this case not to include an attribute that we are sure to be irrelevant.
  • Third, a decision tree itself learns which attributes are important. Generally, the less important an attribute is, the lower in the decision tree that it appears. And so pruning the tree translates into leaving out some attributes. To put it another way, we are telling the tree “Checks that appear further down the tree are going into too much in detail. Forget those out and reach to a conclusion earlier”.

Wrap up

That’s pretty much my highlight of decision trees. If you want to delve into technical details a bit more, Chapter 3 of Tom M Mitchell’s Machine Learning book has an excellent explanation of decision trees. For a quick overview see Learning from Data: Decision Trees slides from U Edinburgh.

How can we apply decision trees in a relational database or a database that supports SQL? This will be the topic of my next post.

Calling Rserve from Java

R has some very useful statistics libraries. It’s an excellent langauge for manipulating and graphing data. What would take multiple lines of Java (or even Scala) code can elegantly be written in R without code looking obscure. One downside is that all that good stuff is hidden from the software industry which mostly uses the mainstream languages such as Java, C# and C++.

Dockerr

That’s is why I recently put together dockerr on github, a dockerized R server with sample java/scala clients.

Let me show you quickly how to talk to RServe. First, let’s run the server

docker run -p6311:6311 tilayealemu/dockerr

Then let’s write our client. The following code is in Scala. You can see full examples on the github page linked earlier.

RConnection connection = new RConnection("127.0.0.1", 6311);
connection.eval("multiply=function(a,b){return(a*b)}");
REXP result = connection.eval("multiply(4,5)");
System.out.println("Result: " + result.asDouble());

That’s it! We sent a request to RServe and printed the results.

Why dockerr?

Any engineer would naturally want to avoid complications and look for a solution that doesn’t involve calling yet another service. A new service is an additional worry. You need a host, monitoring, upgrades. However the only mainstream language that has excellent coverage of statistical and machine learning methods is Python through Scikit and supporting data analysis libraries Pandas, Numpy and Matplotlib. Although Weka is written in Java, it doesn’t have a strong community like R and Scikit do. Being written in Java has its downsides too. Import statements, class declarations, OO paradigm and generally its verbosity make it less favourable for data analysis tasks. Here is for example code fragments to work out the mean of three numbers in R, Scala and Java.

mean(c(1, 2, 3))
val x = List(1, 2, 3) x.sum/x.length
import java.util.Arrays;
import java.util.List;
List<Double> integers = Arrays.asList(1.0, 2.0, 3.0);
double x = integers.stream().mapToDouble(Double::doubleValue).sum()/integers.size();

One doesn’t suddenly make their code compile against a data analysis library and start using it in production. There is a bit of work to do before that happy moment – data cleaning and several iterations involving various data plumbing work, data visualisation, training, tweaking parameters, testing and graphing results. So it’s wise to pick a language that makes these tasks relatively straightforward. And like what I discovered, it’s wise not to spend too much time searching for good data analysis libraries written in Java.

There are also MLlib and H2O, powerful machine learning frameworks written in Java. They put big data at the core of their architecture. The array of algorithms supported and their community size is not a match to what R and python have to offer. So support is limited. If the method you are after isn’t implemented then you may have to do it yourself or wait until someone does it. Their suitability for big data comes at a cost too, particularly for MLlib since setting up the framework and following the Spark programming model can be an overkill for small problems. H2O’s sdk for R and Python make it more attractive here. Running an H2O server is easy as well. You just download and run the jar.

Since I talked about Rserve, I should also mention the Java-R interface JRI. It provides a Java API to locally installed R. It sounds nice in the beginning, and even though I haven’t investigated it in anger, I wouldn’t choose it as a solution because it leads to a monolithic application that runs inside a single JVM. The micro service architecture works better here.

I’m now convinced that calling Rserve from java isn’t a bad idea after all. Dockerr is just a start. More can be done to an Rserve container to improve it. It can be set up to load functions only once. You can also put a number of containers behind a load balancer for better performance. One big flaw is that containers won’t be able to work together. If you have huge data and an algorithm that needs multiple machines to crunch it then H2O and MLlib are worth considering.

There will be other solutions implemented in other languages/frameworks too. There is no one single technology that solves all problems. Data manipulation, graphing, availability of algorithm implementation and performance are some of the factors behind choice of technology for the data scientist. Anyone who is serious about data analysis will need to be flexible about technologies. Remember The Law of Instrument 🙂