15.07.2022 – This week, the we were asked to enter our first competition on Kaggle and put our existing knowledge to the proof. Kaggle is a platform that hosts thousands of datasets across all topics. It was bought by Google and offers introductory courses on how to tackle machine learning problems in general. Some of those datasets are offered by companies to find solutions to their problems and there even is a prize money at the end of the deadline for the winning code! However, most of the datasets are used to practice machine learning problems and ask for support within the community. The datasets cover the full range of supervised/unsupervised learning and therefore it’s up to the user which kind of machine learning algorithm they choose to apply.
Le Wagon chose the “Houses Prices – Advanced Regression Techniques” dataset (here). The goal of this task is to predict house prices, the target or y-variable, as accurately as possible. Kaggle provides you with a train and test set for the features (X_train and X_test) as well as with a train set for the target (y_train), but not with a test set for the target (y_test). So therefore, once the model is trained on the train set (X_train/y_train) and the targets are predicted with the test set (X_test), it is not possible to compare the predictions with the true y_test. The y_test set is kept on Kaggle and once you uploaded your predictions, it will compare them internally with the y_test and assign a ranking (the lower the value that number the better the rank will be). A stumbling block is that there are only a limited amount of uploads within a certain timeframe (around 8 uploads per 24 hours, but I am not too sure about the exact figures). This prevents competitors to constantly upload their predictions, so it’s best to evaluate your model on your local machine first before uploading the predictions (remember: we are uploading the forecasted house prices based on the model that we built). This can be done, for instance, by splitting the train set into a third set, called validation set. You then would have a (smaller) X_train/y_train and an additional X_val/y_val set. The good thing is that you can train your model on the former one and then generate predictions based on X_val which then can be compared with your y_val set (essentially: predictions – y_val = error). Of course, the smaller the error the better your model. However, I found that sometimes one of my models performs better locally than the one before but once the predictions are uploaded, the score is still lower than the supposedly worse predictions that were uploaded before. Still, it’s great fun to try out a bunch of algorithms and try to tweak the model such that the predictions become better and better.
Not everyone uploaded their predictions but the task really showed us what is actually possible and how to approach such a project from end to end. We could also build on previous code which gave us enough time to focus on the intricacies of each model and make the best of it. For this task, I used XGBoost, a powerful tree-based algorithm, with very good results. It is quite a bit of work to make it perform as good as possible. In reality, due to time constraints, this might not always be feasible. Thus, a solution to this would be to just shortlist a couple of models that are suitable for the dataset at hand first and then focus on the hyperparameter tuning second rather than testing all the available algorithms on the dataset. So far, this was the most enjoyable part of the bootcamp!
“It’s the contest that delights us, and not the victory.” – Blaise Pascal