04.06.2022 – We have entered “decision science week” which means that students are required to solve a real world business problem. For this, we we were advised to use the “Brazilian E-Commerce Public Dataset by Olist” from Kaggle (Link). Olist is a Brazilian marketplace where buyers and sellers meet to exchange goods across the country. It is quite big (~127 MB) and consists of 9 csv files.
The first task was to generate the database schema in order to be able to join the columns that are needed for specific analyses. For instance, the ‘seller’ and ‘customer’ csv files only have a column named ‘seller_id’ and ‘customer_id’, respectively. This is a problem because one is unable to join both files as there is no common column. Therefore, one has to use the other files which contain both of these columns and one will then be able to join on a column such as ‘order_id’. Another problem is that you don’t want the file to be to big by joining all the single files into one. The reason is that when you write the necessary functions to retrieve the data for the analysis, the loading time goes parabolic. Although this is not as big of an issue now, but in a live system, loading times are something to be well aware of!
Once the schema was ready, we were asked to explore the data and to get familiar with the contents of each of the columns. Next, we investigated what causes a buyer to give a bad review score to the seller which would lead to the question of the CEO how profit margins could be improved, given that bad reviews represent a cost to Olist. In the next step, we were asked to run multivariate regressions. In other words, the question at hand was which of the many features (in statistics: independent variables) are mostly correlated with the review score (machine learning lingo: target, statistics lingo: dependent variable). It turned out that the highest (negative) correlation exists between ‘wait time’ <-> ‘review score’ as well as between ‘delay_vs_expected’ <-> ‘review score’. This makes sense because the longer you have to wait for your package to arrive, the more annoying it is and the lower the review score for this particular seller will be. For this exercise, we were using “statsmodels”, which is a Python module for running statistical tests. Next, we were introduced to the concept of Logistic Regression. The difference between ‘Logistic’ and ‘Linear’ regression is the character of the target variable which is a class or category in the former, and a continuous variable in the latter case. So for instance, if you have to predict whether the weather tomorrow is hot or cold, you just have 2 categories and you would apply a Logistic Regression model. If you are asked to predict the exact temperature tomorrow, this is tantamount to a continous variable because even if the temperature is 28 degrees C, it’s in fact 28.253 or maybe 27.978, therefore continuous. A Linear Regression model should be applied in this case.
Finally, we were given some background information on the operating costs of the business. Olist takes a 10% cut of all sales which are made on the platform and charges sellers 80 BRL (Brazilian Real) per month for using their platform. There are also some IT costs and all 1 to 3 star reviews are considered a cost to Olist while 4 and 5 star ratings are not (however, they are not a profit either). We were given time to finalise our projects and reuse/repackage as much of our previously written code as possible. This meant that we were encouraged to outsource a lot of code (functions etc.) into separate Python files (.py files) and just import them into our notebook as needed in order to keep the notebook clean and less cluttered.
At 4 p.m. on Saturday, we were asked to give our presentations. For this, we transfered our notebook into an html file with nbconvert, a very cool Python module that allows you to divide your notebook into slides (similar to PowerPoint) and hide the unnecessary sections such as the code that you wrote to generate charts and/or graphs. It was also interesting to see what the other teams focussed on as this data set is so large, it could take months to really delve into all the columns and extract as much information from them as possible. This was a tough challenge and the hardest part so far in the bootcamp in my view. There is limited guidance from Le Wagon and that’s on purpose! In a real world situation, nobody will tell you where to look or which area of the data should undergo further inspection. However, the TAs (teaching assistants) were always ready to help. One other benefit was, that some of the outsourced code was already written by Le Wagon staff. This also represents a real world situation, as you have to get familiar with someone else’s code and make the most use out of it (don’t reinvent the wheel…). It was great to undergo a project in a guided way but have a lot of freedom for implementing ideas at the same time. Every data set is different and the task of a data scientist lies in the correct approach to get out the most useful information in as little time as possible.
“Those who plan do better than those who do not plan, even though they rarely stick to their plan.” – Winston Churchill