When it comes to machine learning and data science, there are so many language options to choose from. Data scientist Jean-Francois Puget does some analysis to decide which one is best.
What programming language should you learn to get a machine learning or data science job? That’s the silver bullet question. I could provide my own answer to it and explain why, but I’d rather look at some data first. After all, this is what machine learners and data scientists should do: look at data, not opinions.
So, let’s look at some data. I will use the trend search available on Indeed. It looks for occurrences over time of selected terms in job offers, and gives an indication of what skills employers are seeking.
(Note however, that it is not a poll on which skills are effectively in use, rather, it is an advanced indicator of how skill popularity evolves.)
I also included Python and R, which we know are popular for machine learning and data science, as well as Scala, given its link to Spark, and Julia, which some think is the next big thing.
Running this query, we get the data we are looking for.
When we focus on machine learning, we get similar data.
What can we derive from this data?
First of all, we see that one size does not fit all. A number of languages are fairly popular in this context.
Second, there is a sharp increase of popularity for all these, reflecting the increased interest in machine learning and data science over the last few years.
Third, Python is the clear leader, followed by Java, then R, then C++.
Python’s lead over Java is increasing, while the lead of Java over R is decreasing. I must admit, I have been surprised to see Java at second place – I was expecting R instead.
Fourth, Scala’s growth is impressive. It was almost non-existent three years ago, and is now in the same ballpark as more established languages. This is easier to spot when we switch to the relative view of the data.
Fifth, Julia’s popularity is not anywhere near the others, but there is definitely an increase in the recent months.
Will Julia turn into one of the popular languages for machine learning and data science? Time will tell.
If we ignore Scala and Julia in order to be able to zoom in on the other languages’ growth, then we can confirm that Python and R are growing faster than general purpose languages.
It may be that R popularity will pass that of Java soon, given the sheer difference in growth rate.
When we focus on deep learning with this query, the data is quite different:
There, Python is still the leader, but C++ is now second, then Java, and C in fourth place. R is only in fifth place. There is clearly an emphasis on high performance computing languages here.
Java is growing fast though, and could reach second place soon. R isn’t going to be near the top any time soon. What surprises me is the absence of Lua, although it is used in one of the major deep learning frameworks, Torch. Julia isn’t present either.
The answer to the original question should now be clear. Python, Java, and R are most popular skills when it comes to machine learning and data science jobs. If you want to focus on deep learning rather than machine learning in general, then C++, and to a lesser extent C, are also worth considering.
Remember however, that this is only one way of looking at the problem. You may get a different answer if you are looking for a job in academia, or if you just want to have fun learning about data science and machine learning during your spare time.
What about my personal answer? Besides having support from many top machine learning frameworks, Python is good fit for me because I have a computer science background. I would also feel comfortable with C++ for developing new algorithms, given that I’ve programmed in that language for most of my professional life.
But this is me, and people with different backgrounds may feel better with another language. A statistician with limited programming skills will certainly prefer R. A strong Java developer can stay with his favourite language, as there are significant open sources with Java API.
Therefore, my advice would be to read other blogs discussing the same question before investing significant time into learning a language.
Jean-Francois Puget is currently the technical leader for IBM machine learning and optimisation offerings. A data scientist by trade, Puget has more than 25 years’ software experience.
A version of this article originally appeared on the IBM developerWorks blog.
Looking for jobs in tech or science? Check out our Employer Profiles for information on companies hiring right now and sign up for our Career Republic e-zine for a weekly digest of sci-tech careers news and advice.