Microsoft Excel is possibly one of the most popular programs that professionals of all stripes use to crunch numbers and analyze data. Others also implement tools like SAS, SPSS, and other statistical software packages that they came across during their studies. But, in the data science field, these programs have some limitations that the sheer mass of data being created won’t allow data scientists of the future capitalize on all the insights one can derive from it.
Why Excel won’t be your best friend in data science
When it comes to Excel, Quartz explains “Excel cannot handle datasets above a certain size, and does not easily allow for reproducing previously conducted analyses on new datasets.”
And when it comes to proprietary statistical software like SAS
Quartz tells us “…that they were developed for very specific uses, and do not have a large community of contributors constantly adding new tools.”
To move beyond what you could get from the more widespread, albeit limited tools we prefaced with above, three programming languages are said to be the bread and butter of the discipline: R, Python, and Hadoop. While the IE Data Science Bootcamp does not teach Hadoop, the curriculum covers both R and Python. As part of the schedule, you’ll grasp core concepts in both programming languages.
This programming language is said to be an absolute must in data science, with some even calling it the “golden boy.” What makes R so alluring? Around since 1992, “was developed explicitly for data analysis by statisticians looking for an open-source solution that could replace expensive legacy systems like SAS and MATLAB.”
Another essential thing to know about R is that it’s a procedural programming language. For those not familiar with different classes of programming languages, the adjective gives us a clue, because: “it works by breaking down a programming task into a series of steps, procedures, and subroutines.”
If you don’t have any previous coding experience, Python may be more your speed. That’s because, with Python, you won’t end up having to write as much code as with other programming languages. And Python also is a big part of something you do every day when you Google something: that’s because Python powers Google search engine.
Python’s raison d’etre since 1989 has revolved around making the code efficient and readable. In contrast to R, Python falls under the object-oriented programming category like Java, C++, or Scala, meaning that “…it groups data and code into objects that can interact with and even modify one another.”
Both R and Python are free, which is undoubtedly advantageous for potential data scientists of all stripes and income brackets. Their dedicated communities of users have come up with applications specific to each programming language, giving a tremendous amount of resources for your particular applications.
R’s analysis-oriented community has developed open-source packages for specific complex models that a data scientist would otherwise have to build from scratch. R also emphasizes quality reporting with support for clean visualizations and the Shiny framework for creating interactive web applications.
It also works with Windows, MacOS, and UNIX, ensuring that you won’t have any limitations where your computer’s operating system is concerned. And when it comes to the type of work best suited, Quartz says “R is good for ad hoc analysis and exploring datasets.” But, R’s most specific advantage is the fact that R proficiency practically is a guarantee of you getting work.
When it comes to Python, it’s often the programming language of choice when the time comes to create a fast-access application because of its high level of performance. It’s also easier to learn because “it is a more general programming language: For those interested in doing more than statistics, this comes in handy for building a website or making sense of command-line tools. The way Python works reflects the way computer programmers think.” Quartz also says “Python is better for data manipulation and repeated tasks.”
At the end of the day, while there has long been a debate over which one is better, the language of choice for data analysis doesn’t matter because many tasks that were once associated exclusively with one can be done using both programming languages.
What should be your rule of thumb?
The choice of programming language will depend on what your colleagues are using, so if everyone on your team gravitates towards Python, then follow suit.