29.06.2022 – In the past weeks, we have been continuously building up our knowledge regarding machine learning. There is so much to learn about it and depending on how deep you would like to go down the rabbit hole, it can be infinitely complex. In my view, the reason that makes it so complex lies in the overlap of different fields of study and without a similar level of knowledge in each of them, there is a limit to progress overall. In other words, it’s a chicken/egg problem. Things we have to know include:
- Programming
- Statistics
- Maths (Linear Algebra and Calculus)
Being “just” good at one or two doesn’t cut it, so you have to brush up on the other skills too. The moment you get good at one thing, let’s say programming, you have to upgrade your skills in the other two areas accordingly. So therefore, it’s not studying three different things for three different subjects but three different things for the same subject! On top of that, it doesn’t hurt to have some domain knowledge. As a meteorologist, I would roughly know what causes a hurricane and I would build my model with the respective independent variables (or in machine learning language: features) in order to forecast the probability of a hurricane occuring. It would be much harder for a novice to decide on which features are important, is it 5 days of sunshine or 5 days of rain before, or maybe hurricanes only occur if the temperature is constantly above 20 degrees Celsius?
Where there is no domain knowledge, we learned how to select features according to their correlation with their dependent variable (or label) and drop all non-relevant features that don’t contribute much to the forecast but cause a lot of unnecessary computing resources. We then moved on to familiarise ourselves with the nuts and bolts of machine learning algorithms. Each algorithm has a loss function that needs to be reduced. However, there is no point to reduce it to zero, meaning that the model will forecast perfectly! The reason is the so called bias-variance tradeoff. If you train your model to perfectly fit your training data (basically totally “overfit” it), there is a high chance that it will make very bad forecasts. So in this scenario, you will have low bias, i.e. virtually zero, but high variance. If you do the opposite, you will “underfit” your model meaning that the algorithm learns too little from your train set to reliably make forecasts in real life. Thus, the loss function has to be reduced such that strikes a good balance between the bias on the train set and the variance on the test set/real data. In order to achieve this, we were introduced to various solvers that can be implemented while training the model and Gradient Descent, a method to find the minimum of a given function (in this case, the loss function).
We then moved on to more advanced algorithms like Support Vectors Machines (SVM) and applied it to the famous Titanic data set. Finally, to automate as much as possible during the time it takes to find an optimal model, we were building pipelines. Pipelines create a seamless workflow and take care of, for instance, scaling, one-hot-encoding or ordinal encoding of specific columns in your dataset. They also come in handy if you are working in a team and you want to export the pipeline for others to use it later on. Once you built them, they generate a dataset that is ready for input into a model of your choice, and all on one click in your Jupyter notebook!
“Learning never exhausts the mind.” – Leonardo da Vinci