Gain a practical understanding of polynomial regression and how it improves modeling accuracy when relationships between variables aren't linear. Learn effective techniques to avoid overfitting and maintain reliable predictions.
Key Insights
- Polynomial regression extends beyond linear regression by fitting curves to data that exhibit non-linear relationships, offering improved accuracy when dealing with fluctuating trends, such as hourly toll booth speeds.
- NumPy's polyfit and poly1d functions simplify the process of polynomial fitting, allowing the transformation of fitted data into predictive functions for convenient visualization and analysis.
- Increasing polynomial complexity can enhance the fit but raises the risk of overfitting, where the model closely fits existing data points but loses predictive power for new, unseen data.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's take a look at a more complex relationship before we move on. We're going to do a polynomial regression. What does that mean? It means that instead of being just a straight line, a regular linear regression, we're going to make it a curve.
Let's take a quick look at our data here. We're in notebook 1.5, so join me there if you haven't. And for our x values, we have the hours of the day, or some hours of the day anyway.
The hours for which we have the data. And then this is the average toll booth speed of a car at those hours. You can see they vary wildly, and in the middle of the night, people are speeding too much.
Let's take a look at a scatterplot of that. So you can see a best line fit is not going to do so great here. Because again, it's highest here, it dips down, and then it goes back up.
We're going to need to use some different tools in order to do that. If there's a linear relationship between two different variables, linear regression is perfect. However, it's not perfectly linear here.
It goes up, it goes down, and we can just see that this is not going to work. We will look at what a linear regression would even look like here. So the first thing we're going to do is we're going to take a slightly different approach.
Instead of making our own predict function, we're going to let NumPy do it for us. So here's how we're going to do that. We're going to say predict speed equals a function.
And it's the function that poly1d gives us. So poly1d is a bit like polyfit that we use to make a y equals mx plus b values. But what poly1d does is it takes in those values, and it gives you back a function that can give you y from whatever the formula you get is.
If we call, if we give it the values, the tuples that np.polyfit gives us, we'll see what happens. We're going to polyfit hours and speeds and give it one slope. If I've got this predict speed, let's try predicting it.
Let's try writing this as a graph, plotting them all. We've got this predict speed, and again, it gives us back a function. If I say, for example, predict speed at the hour of one, and I print it out, there it is.
That's a very bad prediction. At the hour of one, it should be a much different value. It's down, actually down here.
What if we predict speed at five? Five was down at 60. Nope, now it's at 70. So again, it's not a great graph, and we can graph it.
Let's take a look at it. We're going to make some hour ticks, and those are just going to be where our x's are on our graph. We'll use NumPy's linspace function to give us a maximum of 100 y and x from zero to 23.
It'll give us those ticks. And now we'll make some predictions. We'll say predict, run our predict speed, when we ran, you know, our function that NumPy gives us for doing this regression based on our x and our y and one slope.
And we'll say predict speed from those hour ticks. Give me a prediction for each hour. And now we can plot that.
We've already got our scatterplot, but let's make a line. The x will be the hour ticks, the y will be the predictions. That is not great, right? There's no, there are very few numbers that are even near that line.
That's because we don't have a linear relationship. So this line is trying to make a best fit. It's trying to make a line that, you know, has the shortest distances overall to these dots, but it's missing wildly.
It's off on so many of them. So this is not working for us here. Now, that's because it's only one line.
If we increase this number, it's now a cubic line. Let's take a look what that would look like. Ah, it's getting a little better.
It's getting a little better. It's a closer match. However, it's still not a perfect match, because we can see it goes down, and then it goes up, and then it kind of goes up more, and then down again.
So probably the best value here is to have it curve twice. You know, no curves, one. One curve, two.
Two more curves, three. That's pretty good. It's able to give us a regression that is much closer to it.
Okay, and we can actually, you know, check these as well. Let's add some little scatter points for some values that aren't in our hours. We look, four is missing.
17 is missing. Let's plot those from their predictions, from these predictions here. We'll say, also scatter onto this plot the x value four for hour four, and for the y of run the predict speed on four.
And we'll make it, you know, a reasonably sizable dot, bigger than normal, and make it actually, make it a star. And we can see where the prediction would come in for four o'clock. By four a.m., people are starting to slow down.
Maybe they're very tired at that point. So this is how we can do a linear regression, I'm sorry, a polynomial regression. Now, we want to make sure that we're not doing too many curves, because at a certain point, it becomes more about describing the curve than it becomes about predicting.
If I do that, and that, the line's getting closer. But is it actually going to predict values? People find that this is called overfitting. That you are making things fit the data exactly, but that it's learning too much.
It's graphing too closely to its current data, and losing its actual ability to predict things that aren't in that data. And you can end up with some pretty crazy curves. But yeah, I don't, I don't, I don't think that's how that works.
Very close to the dots, though. All right, folks. We'll join you in the next lecture, in notebook two.