is the term used in ML (machine learning) to refer to learning systems (machine learning
models) which are trained to produce appropriate output for a given input. The key word there is trained: in other words, we have at least some data where we know what the output should be for a given set of inputs to the machine learning model.
Let’s see an example. Imagine you have a set of data about apartments for rent in downtown Manhattan. You have information such as the size of the apartment in square feet, number of bedrooms, number of bathrooms, ZIP code, street address, and so on. You also know how much each apartment costs to rent, per month.
Using this information, you want to be able to predict what the rent will be for apartments that are not in your dataset. Perhaps you run a rental website and want to let your customers know what they should expect to pay to live in certain areas of the city or in certain types of apartments.
You split the apartment data for each individual apartment into two pieces:
- Input: All the information you have about the apartment (size, bedrooms, ZIP code…) except for the monthly rent
- Output: the monthly rent for the apartment
You can then train a machine learning model using this data. I won’t go into detail here about how that is done. In practice, you use existing software to feed inputs into your model, check the model’s output against the correct output (monthly rent), then make changes to your model until the outputs from the model start getting more accurate (i.e. closer to the known outputs from your training data).
This is the essence of supervised learning: you are supervising your machine learning model, by checking its output against known “correct” outputs, then making adjustments to the model until it starts to give “right answers”. This is in contrast to unsupervised learning, where you may not know what the outputs “should be” for a given set of data. See the article “What Is Unsupervised Learning?” for more detail.
One important detail which can never be stressed enough is that you must test your machine learning model after it has been trained. Just because your model produces appropriate outputs on the training data does not mean the model will work in the real world! Somewhat like a college student who simply memorizes answers to the test, machine learning models can overfit their training data, meaning they “know” the correct output for each input you used in training, but this does not generalize to new data.
Machine learning professionals solve this problem by setting aside some data before they begin training, which is called the test data. Like the training data, correct outputs are known for each item of input data. Unlike the training data, the machine learning model has no access to this data during training. This data is fed to the machine learning model after training is complete. If the system makes good predictions on the test data, this helps to confirm that the model will work well in the “real world” on data which the model has not seen before.
Key points to remember:
- In supervised learning, you have access to at least some data where you know what the “right answer” (output) is for each input to you are provided
- Supervised learning involves training your machine learning model, tuning it until it starts to produce correct outputs for each input that you feed in
- Supervised learning systems can’t be said to be trustworthy or accurate until you test them on example data that was not used during training
Don’t forget! Supervised learning is just one type of learning system available to us in the world of ML. There are many classes of problems where we do not know what the “correct” output should be for a given dataset. See the article “What Is Unsupervised Learning?” to learn more.