Implement the k-nearest neighbors classifier effectively by understanding the significance of choosing the appropriate number of neighbors. Learn how training this supervised machine learning model enables accurate predictions for new data points.
Key Insights
- The k-nearest neighbors (KNN) classifier is a supervised machine learning algorithm that predicts classes by analyzing the majority vote among the closest data points, typically using an odd number like three or five to avoid ties.
- Selecting more neighbors generally increases model accuracy but also intensifies computational demands; three neighbors is a common starting point, though five has become increasingly popular due to improved computing power.
- Training the KNN model involves fitting the classifier with labeled data points (X, Y coordinates) and their corresponding classes (zero or one), enabling the model to predict the category of new, unseen data points effectively.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's start a k-neighbors classifier, our supervised machine learning model based on the values we've got so far. We're gonna save this time, the zipped up version of X and Y. We'll say data points equals a list made from zipping up X and Y. And we can look at data points here. And there they are.
This is the way before, but now we've saved them because we're gonna feed that to the k-nearest neighbors classifier. All right, first we create our k-nearest neighbors. We'll say KNN model is our k-nearest neighbors, our k-neighbors classifier.
And we set the at number of neighbors to three. Three is a typical number, five is also a typical number. So this is how many different neighbors to look at when you're trying to look for the majority winner.
Now, which is this one the most like? And the more neighbors you include, the more accurate your model is gonna be, but also the more time intensive it will take, it'll be to train your model. A three is a fine midpoint. People tend to do more five these days.
Again, because computers are better, they're always getting faster at this. It depends on your data and how complicated your data is as well. One is generally seen as way too low now.
Now, it has to be an odd number here because it has to be able to look at the majority. It can't have a tie. If it were considered four neighbors or two neighbors, then it could have a tie.
One is this class, one is that class, or two are this class and two are that class. And we wanna definitively get a prediction. Not a like, well, I don't know kind of prediction.
So that's why we use an odd number. All right, let's run that. And boom, we have a KNN model, but it's not trained yet.
The training or fitting of our model is, again, giving it an X and Y. In this case, our X is our data points, our X, Y coordinates. And classes is the ultimate category that they belong to, zero or one. So it's going to, let's run that and I'll talk about it.
KNNModel.fit. We'll pass it the data points. And then we're going to, and the classes. And now what's happened is we got back a model, yay.
And it is now trained on the data in a little bit different way than the other one. It didn't make a new algorithm. Instead, it now has all the data it needs to use the k-nearest neighbors algorithm on any new points we tested on.
Let's test it, in fact. First, we're going to add a new data point. And then we're going to see how it does predicting it.