Which Programming Language For Data Science Should An Excel User Learn?
Let’s be honest, you love Excel. You know how to use VLOOKUP, how to create a pivot table, how to split text into columns, how to write macros… you know it all. It’s beautiful and logical and… you’re using it so much; you’ve started to notice its limitations.
Excel is crashing when copying a function to all cells in a column. You cannot open a large dataset with more than 1048576 rows and 16348 columns. You need to write, copy and link stuff again and again…
Programming languages like R and Python address those Data Science challenges and provide a great solution for data processing, data cleaning, transformations, aggregations, data visualisation and Machine Learning modelling.
But here comes the question – which programming language for Data Science should you learn – R or Python?
Benefits Of R And Python Over Excel
Programming languages for Data Science like R and Python have some clear benefits over Excel:
- R and Python are faster.
- R and Python can handle very large datasets.
- R and Python can create reusable workflows and reduce repetitive work on routine tasks.
Which Programming Language Is More Popular?
Both R and Python are among the top trending, most wanted and top paying programming languages according to the 2018 Developer Survey Results. Python, however, attracts more attention due to its broader application. Python is not only used for Data Science but also as a general-purpose programming language with a wide range of applications. All the big technology companies use R, Python or both for their Data Science projects.
When Is R Better?
R is a free, open-source programming language created by Statisticians. It’s used for data exploration, data visualisation and nearly any type of data analysis due to high number of available packages. R is great for people who love statistics and data visualisation.
When Is Python Better?
Python is a general-purpose programming language with a wide range of applications. Python emphasises on usability and allows writing clear code in fewer lines. Often used when data analysis tasks and algorithms need to be implemented for production use. Most of the Deep Learning (Convolutional Neural Networks) libraries are written in Python. Python is great for deep Learning models, deployment of algorithms for production use and it’s preferred by developers who like writing clear code.
Which Language To Choose?
You’ve guessed right – you most likely need to learn both. So, the question is which one you should learn first. That would depend on your use case and on your background. Statisticians might find R the logical choice. For a Developer, however, choosing Python might be a no-brainer because of its clear syntax. Excel users with no statistical or development background need to be prepared for a steep learning curve; and from my experience, the biggest challenge is to get used to not looking at your data all the time.
Where Do You Start – IDE And Key Packages
First, you need to download and install the language of your choice and IDE (Integrated development environment) where you are going to write and run your code.
R | Python | |
---|---|---|
Current version of the language | R 3.5.0 < https://cran.r-project.org/ > | Python 3.6.5 < https://www.python.org/getit/> |
IDE | RStudio | Jupyter Notebook, Spyder, PyCharm, etc. |
Second, you need to get your head around which package does what. Here is a summary of some of the most frequently used packages in R and Python:
Key libraries | R | Python |
---|---|---|
Data manipulation & computing | dplyr, data.table | Pandas, NumPy, SciPy |
Text mining | stringr | string |
Time series | zoo, xts | Prophet |
Visualisation and reporting results | ggvis, lattice, ggplot2, shiny, RMarkdown | Matplotlib, Seaborn, Plotly |
Machine learning | caret, randomForest, nnet | scikit-learn, Keras, TensorFlow, NLTK |
Data scraping | rvest | Scrapy |
Third, you need to choose the training course or tutorial that is best suitable to your learning pace. Popular platforms with great learning resources are:
- DataCamp
- Stack Overflow
- Coursera
- Google’s Machine Learning Crash Course
- GitHub
- Quora
- R-bloggers
- Kaggle
And finally, here are my 3 tips for getting started with data science programming:
- Forget about Excel.
- Learn by doing. Break down your task into small chunks and try to solve them one by one.
- Use the community knowledge.