AI Atlas #10:
Degrees of Supervision
Rudina Seseri
This week, I am covering the degree of supervision in machine learning model training. Degree of supervision refers to the level of human intervention required during the training phase of a machine learning algorithm and is a key decision a data scientist and/or machine learning engineer must make in the early days of product building.
Selecting the appropriate degree of supervision is as important as the AI techniques one selects for the problem they are trying to solve as an entrepreneur or business because it will affect the accuracy, efficiency, trustworthiness, and robustness of your model. As an AI investor, this is a skill set I look for in founders!
๐บ๏ธ What are Degree of Supervision?
Degree of supervision refers to the level of human intervention required during the training phase of a machine learning model. The training phase refers to the process of teaching a model to make predictions and classify data based on the characteristics or attributes of the data. This data can be labeled, unlabeled, or a hybrid of both. Labeled data refers to data that has been tagged or marked with a specific label or category that indicates what the data represents. Unlabeled data has not been tagged in this way and is thus raw and unstructured.
In machine learning, there are three main types of supervision:
Supervised Learning: a type of machine learning where a model is trained on a labeled dataset with known outcomes. The goal of supervised learning is to discover paths between input data and labeled outputs, where the output labels are typically categories or numerical values.
Example: A model can be trained to identify hand-written numbers if images of handwritten numbers are labeled with the number they represent. If images of thousands of different ways people draw the number five are labeled with the tag โ5โ, the model can be trained to identify handwritten โ5sโ.
Unsupervised Learning: a type of machine learning where a model is trained on an unlabeled dataset with no known outcomes. The goal of unsupervised learning is to discover and learn patterns, structures, or relationships in the data without explicit labels or categories.
Example: If you have a dataset of customer purchases (item purchased, cost, and transaction date), but there are no explicit labels indicating categories of customers, a model could be used to gain insight into purchase patterns by grouping customers based on buying behavior. This example of unsupervised learning leverages a particular technique called clustering.
Semi-Supervised Learning: a type of model training in which the machine learning model is trained using both labeled and unlabeled data. The labeled data provides some supervision, while the unlabeled data allows the algorithm to learn more about the structure of the data.
Example: In an image classification task, the model can be given a subset of images that are labeled and a large number of unlabeled images. Supervised learning can be done utilizing the labeled images and the model can leverage the large amount of unlabeled images to learn the underlying structure of the data and improve the accuracy of the classification task.
A deeper dive into semi-supervised learning can be found here.
๐ค Why Levels of Supervision Matter
The form of supervision matters in machine learning model training because the amount and quality of labeled or unlabeled data used in training has a significant impact on a modelโs accuracy and generalization ability. It is an important decision made by a machine learning engineer or an AI entrepreneur in the early days of solution design.
One would pick supervised learning when there is a well-defined prediction task and/or a sufficient amount of labeled data to train the model. The primary advantages of supervised learning include:
Predictive Accuracy: Supervised learning can often achieve higher accuracy than unsupervised methods.
Interpretability and Transparency: Supervised learning is often easier to interpret because the output relates directly to the input features, making it easier to understand how changes in the input data affect the output.
One would pick unsupervised learning when there are no well-defined prediction tasks and/or labeled data is limited or expensive to obtain. The primary advantages of unsupervised learning include:
No Need for Labeled Data: The lack of required labeled data makes unsupervised learning a more cost-effective alternative.
Discovery of Hidden Patterns/Relationships: Unsupervised learning can lead to new insights or knowledge about the input data through the discovery of patterns or relationships.
Flexibility: Unsupervised learning does not require prior knowledge or a specified problem and can thus be used for a wide range of problem domains or data types.
One would pick semi-supervised learning when there is a limited amount of labeled data and a large amount of unlabeled data. The primary advantages of semi-supervised learning include:
Performance: Semi-supervised learning leverages a large amount of unlabeled data to improve the accuracy and generalization ability of the model, while still using the labeled data to guide the learning process.
Lower Cost: Semi-supervised learning can be more cost-effective than supervised learning because labeling data can be expensive or time-consuming.
Data Quality: Semi-supervised learning can be particularly effective in scenarios where the labeled data is biased or incomplete, as the unlabeled data can help to overcome these limitations by providing a more diverse and representative sample of the population.
๐ Uses of the Types of Supervision
Supervised Learning is best suited for use cases where the task involves predicting an output and data is labeled including:
Image Classification
Sentiment Analysis
Fraud Detection
Unsupervised Learning is best suited for use cases where there is no labeled data, and the goal is to discover patterns or relationships in the data including:
Time Series Analysis
Anomaly Detection
Dimensionality Reduction
Semi-supervised learning is best suited for use cases where there is a hybrid of datasets of labeled and unlabeled data including:
Text Classification
Speech Recognition
Object Detection
While there are valuable applications for all types of supervision, with the growing availability of unlabeled data and demand for more efficient and effective machine learning algorithms, semi-supervised learning, particularly, will likely play an increasingly important role across domains. Read more next week as we dive specifically into semi-supervised learning!