Additional Content
Does Data always have to be labelled?
Does Data always have to be labelled?
No, not always. A good part of machine-learning algorithms is either supervised learning algorithms or unsupervised learning algorithms.
When you want to classify a photo of a dog, cat or gorilla, you could feed the machine with photos tagged as dog, cat or gorilla. When you want to grade an essay, you could feed a lot of corrected essays, labelled with their respective grades. In each case, we knew what the output would look like: dog, cat, gorilla, A+, A, A-, D…
Given labelled data during training, the algorithm tries to find a function or a mathematical recipe, if you like, that matches output to input. Often, this also means that the programmer tries out different algorithms to see which one comes up with the best matching function. But as long as the data has labels, these labels act like a supervisor or a guide that verifies that the function selected by the algorithm does indeed work1. If the function gives an output different from that of the label, the algorithm has to find a better one.
But labelling data is a time-consuming and costly process, which often involves hiring human beings. Also, if we are just looking for patterns in the data and don’t have a clear idea of what pattern we will find, the output is not even known to us. Thus, the data cannot be labelled. This is where unsupervised algorithms come in.
Instead of trying to match input to output, these algorithms try to find regularities in the data that will help group the input into categories1. Banks use unsupervised machine learning to detect fraudulent activity in credit card transactions. Since there are a huge number of transactions at any given minute, and we won’t know how to detect patterns and label an activity as a fraud, we rely on machine learning to find the pattern automatically. Clustering any given group of students into a fixed number of groups is also a problem that often uses unsupervised learning. So is finding terrorist activity if given cellular activity in a network.
1 Kelleher, J.D, Tierney, B, Data Science, London, 2018.