Machine Learning Algorithms: Supervised Learning
04/08/2023 2023-08-04 9:11Machine Learning Algorithms: Supervised Learning
Edited by Pier Giuseppe Giribone
The previous article in this column provided a definition of Machine Learning and highlighted the main differences in operation between a traditional algorithm and one designed according to modern machine learning paradigms.
This article and the following ones will focus on the most common classification criteria for machine learning algorithms. We'll begin by discussing how these can be divided based on the type of learning being designed.
Machine learning systems can be classified according to the amount and type of supervision they are subjected to during training (training).
The literature typically recognizes four main categories: supervised learning (supervised learning), unsupervised learning (unsupervised learning), semi-supervised learning (semi-supervised learning) and reinforcement learning (reinforcement learning)Let's begin by understanding the fundamental characteristics that distinguish the first type, which is also the most widespread and used.
In supervised learning, the training set used to train the algorithm includes the desired solutions, called labelThis learning method is very similar to the process by which humans learn from empirical knowledge. The most fitting metaphor is that humans acquire knowledge by solving problems.
Consider the problem-solving process described by the following logical flow:
- Select an exercise that proposes solving a problem. Current knowledge is applied to find a solution. The resulting answer is compared with the correct one provided.
- If the answer is incorrect, the current knowledge is changed.
- Repeat steps 2) and 3) for all exercises that pose a problem.
If we were to apply an analogy between this example illustrating the human learning process and the training process of a machine learning method, we might consider that the exercises proposing problems and their solutions correspond to the training data and the knowledge being developed by the model. The key aspect for a supervised learning method is that solutions to the proposed problems be available. Since the solutions to the problems on which the knowledge will be based must be correct, it is essential that there is a supervisor who provides a correct and unbiased training dataset (unbiased).
A typical task for supervised learning is to classificationThe spam filter is a good example: it is trained with examples of regular and non-regular emails (classes) and must learn from these to classify the new ones.
Another typical task for a supervised algorithm is to predict a target numerical value, such as the price of a car, given a set of feature (for example: mileage, brand, wear and tear…) said predictorThis task is called regressionTo train the system, we need extensive background knowledge, which includes both the predictors and their labels (i.e., their prices).
In a credit context, an example of dichotomous classification can be provided by a machine learning algorithm that, once trained on a dataset of significant capital ratios (features) associated with the information of interest, such as whether the companies were bankrupt or not (labels), provides useful indications of the potential default of new applicants. The algorithm must estimate the class to which the company belongs (default yes/no), querying it based on the capital ratios of a new company.
The previous concept can be extended to a multi-class classification context. In this case, the machine learning algorithm will not provide a simple binary "default yes/default no" indication, but will instead output a membership level based on an ordered scale of values. This classification methodology is similar to the procedure used by rating agencies, which assign a creditworthiness level to the company under consideration.
The purpose of a statistical regression is to estimate a potentially existing functional relationship between the dependent variable (label) and the independent variables (feature).
There are therefore a large number of examples of potential applications in the banking sector: the analysis of the term structures of interest rates in actuarial science, the analysis of historical series of asset returns in econometrics, the study of the potential forward-looking trend of a price as a function of the release of new macroeconomic data in trading, the reconstruction and regularization of volatility surfaces for a more reliable estimate of the value of an option in quantitative finance, and so on.
If we consider the previous examples, in supervised learning, each training dataset consists of pairs of correct inputs and outputs, that is, what is expected as output logic from the model if given the corresponding inputs.
{input, correct output}
Learning in supervised learning consists of a series of revisions that are performed on a model in order to reduce the difference between the correct output and the output provided by the model given the same input data.
If the model is trained perfectly, it will produce a result as close as possible to the correct output for a given set of inputs. Both classification and regression are types of problems addressed using this type of learning.
Classification determines which group the input data belongs to, so the "correct output" is implemented as a category. In contrast, regression predicts values, and the "correct output" is a numerical value.
Before concluding, it is important to clarify the difference between an attribute and a feature.
As a general rule, in Machine Learning a attribute is a data type (e.g. mileage), while a feature It can have different meanings depending on the context, but generally refers to an attribute accompanied by its value (e.g., mileage = 12.000). However, it is common practice to use these two words interchangeably.
The next article will focus on analyzing the fundamental characteristics that distinguish machine learning algorithms whose learning is unsupervised.