Is Julia the Future for Big Data Analytics?

In many Big Data analysis blogs, at Big Data meetups and in the halls of the most recent O’Reilly Strata Conference, one of the most-discussed topics is which language is better for data analysis: Python or R. Some of the talk has even reached “religious” overtones not unlike previous discussions on Windows vs. Linux or Microsoft’s Internet Explorer vs. Mozilla Firefox.

Julia Language LogoSo what’s the issue here? Why are Big Data analysts so concerned with what language to use? In my honest opinion, the root of the issue probably has more to do with the tools they learned on than anything else. But let’s briefly look at each.

Python

Python is a general purpose scripting language which can do many things, from complex data processing and data munging to the implementation of mathematical and algorithmic functions for machine learning. Many developers are comfortable with Python since it’s easier to learn than R.

As Python is a scripting language, it allows the data analyst to easily play around with data sources and data parsing ad-hoc without using a formal programming model. With the use of other libraries you can do text mining, vectorizethe text data and identify similarities between posts and texts.

Python also has an OOP model, so having an OO language in your tool kit allows you to program structured and modular applications should that be your choice. This can be seen as an advantage over R.

R

R is an extremely rich environment, especially when you get into statistics. Inference, statistical modeling and then plotting your data on a bar, pie chart and histogram is trivial in R, as it’s formatted for statistical modeling using vectors and/or matrices.

As R was created by statisticians for statisticians, someone who has a general knowledge of statistics usually finds it exceptionally easy to master. Programmers of other languages also seem to have an easy time learning and using it.

If you’re a data analyst who wants to see data distributions before drawing conclusions, R allows you to visualize outliers and data density. For probabilistic problems and distributions, and linear regression problems, R’s ease of use of data manipulation using vectors and matrices makes life exceptionally simple.

With R’s statistics-rich library of algorithms, there’s no need for understanding the specifics of data types, as would be required with Python. It has tremendous following and support, especially from the academic and commercial statistics communities, and now the Big Data analytics community.

Python vs. R?

Should you use one over another in Big Data analytics? I think that both are valuable and you should examine specifically what problem you’re trying to solve. Both Python and R need to be in the data scientist’s and data analyst’s tool box, and a skilled Big Data professional should be ready to use either, depending on the problem they’re working on.

A recent survey of data scientists and data miners by KDNuggets found that “R has a solid lead, and was used by about 77 percent of the voters. Python was used by about 32 percent of voters.”  When it comes to pay, the data scientists and data analysts who had the highest salaries knew R, according to Dice.

Is R better than Python? For some things. From a systems performance standpoint, it seems that the performance of R and Python is very much the same.

An Alternative: Julia

What is Julia? It’s “a high-level, high-performance dynamic programming language for technical computing.” It naturally has many, many of the mathematical and statistical libraries found in any high performance environment. It’s also very extensible: There’s a built-in package manager for the addition of new external libraries and packages.

Julia is built for speed. Applications using it rather than Python or R have been found to be ridiculously fast. Here are some comparisons from the Julia Language website:

application Julia Python R
fib 0.91 30.37 411.36
quicksort 1.14 31.89 524.29
mandel 0.85 14.19 106.97
pi_sum 1.0 16.33 15.42

 

How do programs written in Julia run so fast? Because of its LLVM-based just-in-time (JIT) compiler, which is designed for a high performance environment. Julia is also designed for cloud computing and parallelism as it provides a number of key building blocks for distributed computation. That makes it flexible enough to support a number of styles of parallelism, and allows users to add more.

Julia is also a very easy program to learn, use and debug. If you have previously done any kind of programming in C#, C, C++, Java, Python, R, etc., learning Julia should be a cake walk.

A number of MIT video tutorials for learning Julia are located here.

Conclusion

Will Julia replace Python or R? Not yet, since some libraries useful in performing Big Data analysis are just not available. However, with greater adoption, it could be the case within three years. After all, technological advances move very rapidly, especially when it comes to Big Data.

Would I recommend Julia for Big Data? Like Python and R, I think it should be a part of every data scientist’s and data analyst’s tool kit.

Comments

  1. BY Ed says:

    Github has more links: http://svaksha.github.io/julya/

  2. BY mrego says:

    What about the Go language?
    Julia is only a year old, right? Perhaps give it some time before rushing to hype.

  3. BY Matt says:

    My thinking…
    R: Use for “niche” statistical tools and techniques such as shapiro-wilks test for normality, qq-plots, runs test, etc.
    Python: Use for data manipulation, web apps, scraping, moving/edit/managing files
    C/C++: Use for “Fast applications”

    I’m not sure why Julia would replace any of those. If I want fast, I go to C, if I want easy or niche items I use R, and I use Python for everything else.

    • BY Rico says:

      MATT,

      Julia is built from the ground up for cloud computing and parallelism/distributed computation.
      That would be it’s advantage over the C and R languages.

    • BY Simon Thompson says:

      Matt – you choose Python or R in preference to c/c++ for less performance critical applications, my guess is that you do that because you are more productive using those tools that you are when you have to revert to c/c++. The performance pay off of c/c++ is sufficient (for you) to make up the technical difficulty of managing memory and loss of expressiveness that you incur.

      If Julia matures it will be the case that you will be able to use a language that is as expressive as Python (apart from one thing – comprehensions are not quite as trendy, also commenting is painful as there is no comment block syntax) but with dependency management via Git (not eggs, which apparently I am the only programmer in the universe to spend days trying to deal with when versions don’t hang together) and no crazy as a baboon on smack whitespace based syntax.

      Also it is ~x30 faster on my machine for the (few) things I have done. *warning performance is dependent on how you build things, if you write dumb code it will go slow.

      Actually I found one thing that Julia is very sloooow for – which is loading Jpegs into iJulia on Windows (iPython notebook in Julia) because the implementation for rendering Jpegs on Windows uses imagemagik and that is slow in PC’s.

      And you can use python in Julia, and R (via Rif) although I have not done much of either.

      Hadoop support is patchy and also Julia agents are unmanaged by Yarn I think.

Post a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>