**07.06.2022 –** Finally, the wait is over! We were all looking forward to the start of the machine learning (ML) sessions… This is where the real fund begins, right? In the coming weeks, we are in for 10 lectures, covering the entirety of learning the ML fundamentals, data preparation, performance metrics, what’s going on under the hood, how to model tune the right way, ML workflow, ensemble methods, unsupervised learning, time series and last but not least, natural language processing (NLP).

Let’s start with the ML fundamentals. It seems that both Logistic and Linear Regression is the bread and butter business of a data scientist. For this, we were introduced to Scikit-learn, a powerful machine learning library for, you guessed it correctly, Python!! We started with the necessary steps to get useful insights from ML. The general summary of ML is to find a model that extracts a pattern or patterns from a so called training set such that its predictions are as accurate as possible on a test set. In other words, the model should be trained on data that it has seen and then generalise well on data it has not seen (unseen data). The art is to find a balance between those two. One can build a model that works (nearly) perfectly on the training set but performs poorly on the test set. Likewise, if the model is not trained well enough on the training set, it won’t be able to extract patterns from it which in turn are necessary to make predictions based on unseen data. This might happen if a model is used that is not adequate for the data set at hand. For instance, if we are predicting a relationship between X (the vector of independent variables or features) and y (the dependent variable or target) based on a linear model (e.g. x maps on y with a factor of 1, therefore: if x = 1 -> y=1, x = 2 -> y=2, … , x=n -> y=n etc.) but the relationship is actually non-linear (e.g. x maps on y with a factor of 1^{2}, therefore: if x = 1 -> y=1, x = 2 -> y=4, … , x=n -> y=n^{2} etc.) the model will ‘underfit’ the data and not generalise well.

The task is to forecast the sales price of a house where the house surface is the feature and the house price is the target. We are experimenting with Linear Regression, cross-validation and the visualisation of the data. In order to save computational time, which is even more important for huge data sets, we are also looking into learning curves. This approach ensures that we select the minimum training size which is big enough to sufficiently train our model. It works a bit like the Pareto prinicple. You are trying to achieve 80% of the success (= finding a good model that generalises well) by inputing 20% of the resources (= using 20% of the data for training). This saves a lot of time because otherwise feeding the model with 100% of the data would increase the computing time by a factor of 5! We are just at the beginning of this complex subject area. I will post about the process here over the next weeks as usual!

*“It is a capital mistake to theorize before one has data.”* – Sherlock Holmes