Calling Rserve from Java

R has some very useful statistics libraries. It’s an excellent langauge for manipulating and graphing data. What would take multiple lines of Java (or even Scala) code can elegantly be written in R without code looking obscure. One downside is that all that good stuff is hidden from the software industry which mostly uses the mainstream languages such as Java, C# and C++.

Dockerr

That’s is why I recently put together dockerr on github, a dockerized R server with sample java/scala clients.

Let me show you quickly how to talk to RServe. Here we connect to our RServe instance (running locally on docker) and send a small addition task to it. The code is in Scala. You can see full examples on the github page linked earlier.

RConnection connection = new RConnection("127.0.0.1", 6311);
connection.eval("multiply=function(a,b){return(a*b)}");
REXP result = connection.eval("multiply(4,5)");
System.out.println("Result: " + result.asDouble());

Why dockerr?

Any engineer would naturally want to avoid complications and look for a solution that doesn’t involve calling yet another service. A new service is an additional worry. You need a host, monitoring, upgrades. However the only mainstream language that has excellent coverage of statistical and machine learning methods is Python through Scikit and supporting data analysis libraries Pandas, Numpy and Matplotlib. Although Weka is written in Java, it doesn’t have a strong community like R and Scikit do. Being written in Java has its downsides too. Import statements, class declarations, OO paradigm and generally its verbosity make it less favourable for data analysis tasks. Here is for example code fragments to work out the mean of three numbers in R, Scala and Java.

mean(c(1, 2, 3))
val x = List(1, 2, 3) x.sum/x.length
import java.util.Arrays;
import java.util.List;
List<Double> integers = Arrays.asList(1.0, 2.0, 3.0);
double x = integers.stream().mapToDouble(Double::doubleValue).sum()/integers.size();

One doesn’t suddenly make their code compile against a data analysis library and start using it in production. There is a bit of work to do before that happy moment – data cleaning and several iterations involving various data plumbing work, data visualisation, training, tweaking parameters, testing and graphing results. So it’s wise to pick a language that makes these tasks relatively straightforward. And like what I discovered, it’s wise not to spend too much time searching for good data analysis libraries written in Java.

There are also MLlib and H2O, powerful machine learning frameworks written in Java. They put big data at the core of their architecture. The array of algorithms supported and their community size is not a match to what R and python have to offer. So support is limited. If the method you are after isn’t implemented then you may have to do it yourself or wait until someone does it. Their suitability for big data comes at a cost too, particularly for MLlib since setting up the framework and following the Spark programming model can be an overkill for small problems. H2O’s sdk for R and Python make it more attractive here. Running an H2O server is easy as well. You just download and run the jar.

Since I talked about Rserve, I should also mention the Java-R interface JRI. It provides a Java API to locally installed R. It sounds nice in the beginning, and even though I haven’t investigated it in anger, I wouldn’t choose it as a solution because it leads to a monolithic application that runs inside a single JVM. The micro service architecture works better here.

I’m now convinced that calling Rserve from java isn’t a bad idea after all. Dockerr is just a start.Β More can be done to an Rserve container to improve it. It can be set up to load functions only once. You can also put a number of containers behind a load balancer for better performance. One big flaw is that containers won’t be able to work together. If you have huge data and an algorithm that needs multiple machines to crunch it then H2O and MLlib are worth considering.

There will be other solutions implemented in other languages/frameworks too. There is no one single technology that solves all problems. Data manipulation, graphing, availability of algorithm implementation and performance are some of the factors behind choice of technology for the data scientist. Anyone who is serious about data analysis will need to be flexible about technologies. Remember The Law of Instrument πŸ™‚

6 thoughts on “Calling Rserve from Java

  1. Dear Tilaye, Thanks for sharing the post. I’m also a big fun of R language and programming in general. I came to know you and your contributions via your other personal blog (http://ertale.com). I’m a big fun of Yechewata Engida by Me’aza Biru from ShegerAddis πŸ™‚ Why am I telling all this? You might wonder what the heck this could be related to the post? That’s where I was motivated to leave you this comments. My experience with ertale was both interesting and challenging. The challenging part was its search engine and user experience is not well developed. If I would like to listen all interviews of my favorite guest, for example, the website doesn’t filter well which means I have to scroll all over to get the result. Anyways, I decided to develop an R script/package which uses Web Scrapping and Text Mining technologies. The R script enables me to listen all the series of interviews for a given guest without any challenges. Hola πŸ™‚ I’m now enjoying YEngida to the fullest. I called the package ‘YEngida’ and available on GithHub. I am happy to share the code and please let me know if you would like to check it out. keep up the good work. Greetings, Haile

  2. Hi Haile. Glad you like Sheger and sorry sheger.ertale hasn’t been easy to use. For example searching for ‘Gebru’ on Yechewata Engda gives me this,
    http://sheger.ertale.com/?s=yechewata+gebru
    What’s the issue with this search result and what does your script do differently? Sheger pages used to be static pages that proved hard to maintain and navigate through. Hence the switch to WordPress last year. If search isn’t good enough I can look for other plugins.

    Agree, R is fun to use πŸ™‚

    • Hi Tilaye, Nothing is wrong with the search engine. However, the search results are scattered over multiple pages. It would be nice if you could embed as a playlist in wordpress. I can imagine that would need some effort. What my code does is the following: (1). Scrap all guests (e.g., title, category, hyperlink, .mp3 link). It took 4 mins to scrap all your posts until today. Please don’t punish me for doing this for your knowledge πŸ˜‰ I told you I am a big fun of YEngida program πŸ™‚ (2). Once the list is created, I preprocess the title from which I know release date, week/part, guest name, his profession, etc. (3). I create a csv file and save it to a disk. So, the list is more like a playlist for me. Whenever I need to listen a guest, for example, I would go to the csv list and search for guest name and listen by opening the link from the list. Hola. The other main reason I took this challenge was to teach my self ‘Web Scrapping using rvest package’. Do you know this package? I think it worth to check it out. In case you want to see the code, I am happy to share. Please let me know.

  3. Hi Haile. I don’t know if one of them was you but I have blocked some people in the past who seemed to be aggressively scraping content. It’s not for punishment as such. Too many requests can act as DDOS attacks and bring the server down, making the site unavailable for all users. So it seems reasonable to block such users.

    Putting that aside, I’m glad you found a solution to your problem. The number of posts on each post is limited because the fewer audio there are in a page the faster a page loads. I’ll keep your comments in mind. If more people raise it I can have it so more posts show up.

    Haven’t used R’s scraper but I’m sure its good. Have fun.

  4. I’m not the one for sure since I have never been blocked by you. I’m happy that you consider my comment. Keep up the good work.

  5. I have checked your page and i have found
    some duplicate content, that’s why you don’t rank
    high in google’s search results, but there is a tool that can help you to create 100% unique articles, search for; Boorfe’s tips unlimited content

Leave a Reply

Your email address will not be published. Required fields are marked *

*

code