You are currently viewing 5. Help me, panda(s)!

5. Help me, panda(s)!

  • Post author:
  • Post category:General

09.05.2022 – The last week saw the introduction of both Jupyter and pandas. Let’s focus on Jupyter first! Jupyter is a simple and web-based IDE (integrated development environment) where you can write your Python code and annotate it in so called markdown cells. The advantage is that you can code and document your entire project at the same time and export it in the end as a readable pdf-document for non-technical project members to follow along your thought process. It is basically the electronic version of a notebook and can also include charts, graphs and other types of visualisations of your data. Moreover, Jupyter can also be used as a terminal. This comes in handy if you quickly need to install another Python package or use ‘pwd’ to print your current workig directory.

A more fancy version of Jupyter is called Jupyter Lab. In it, you can run windows side by side (e.g. two or more notebooks or even a terminal window) so changes in one have instant effect on the other one. This is very handy if you want to keep your notebook clean and therefore outsource certain types of your code into a separate Python file. In your notebook (i.e. Jupyter/Jupyter Lab) you just import the Python file and are able to use all the functions of the file that you outsourced for your project. It ensures that non-technical readers of your notebook are not cluttered with extra code cells and are able to focus on the main message that you would like to convey. It also comes with a nice dark theme which is very pleasing if you are working at night because it’s easier on your eyes. Le Wagon teachers recommended the use of Jupyter Lab and having seen both, I agree. It has all the functionality of the “normal” Jupyter notebooks but is more comfortable to use.

After the intorduction to Jupyter, we started using NumPy (Numerical Python), a very powerful library to store data in the form of multi-dimensional arrays and matrices. The advantage of numpy lies in the possibility to apply vectorised calculations in or on an array. For instance, if you have a two-dimensional array of let’s say 10,000 rows and 2 columns. You then would like to compute the sum of these two columns and store it in a third column, there is no need for a for-loop that iterates over 10,000 rows. Numpy vectorises this operation and applies a sum-method to all 10,000 rows at once! This might not sound like a big deal first, but if you have millions of rows and/or columns, it saves a lot of time and is for sure much more computationally efficient. Also, in a team of data scientists who work on a server simultaneously, vectorised operations would clog CPU time far less than using for loops.

Because numpy can only hold one data type and array slicing is only possible by the use of numerical values (rather than row/column names), pandas was built on top of numpy. Pandas stands for “panel data” and implies, that arrays (in pandas called “dataframe”) can hold different types of data such as categories, dates, integers, floats etc. Data can be accessed directly via the column name. The great thing about pandas is that it keeps all the functionality of numpy, i.e. vectorised operations are still possible! On top of that, pandas can read different types of data files such as Google big query, csv/xls, json, html and many more. It is THE main library for data wrangling even before moving to fancy model development. Pandas and numpy can be seemlessly used in Jupyter.

Usually, the Le Wagon timetable focuses on one topic per week (i.e. the two sessions on the weekdays and one on Saturday). This time, we spent the entire week on Jupyter/NumPy/Pandas and will have another half week for the data visualisation part with matplotlib and seaborn. It shows how important these tools are for the aspiring data scientist!

“Smooth seas do not make skillful sailors.African proverb