Polynomial Regression for Curved Data Analysis

Gain a practical understanding of polynomial regression and how it improves modeling accuracy when relationships between variables aren't linear. Learn effective techniques to avoid overfitting and maintain reliable predictions.

Key Insights

Polynomial regression extends beyond linear regression by fitting curves to data that exhibit non-linear relationships, offering improved accuracy when dealing with fluctuating trends, such as hourly toll booth speeds.
NumPy's polyfit and poly1d functions simplify the process of polynomial fitting, allowing the transformation of fitted data into predictive functions for convenient visualization and analysis.
Increasing polynomial complexity can enhance the fit but raises the risk of overfitting, where the model closely fits existing data points but loses predictive power for new, unseen data.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's take a look at a more complex relationship before we move on. We're going to do a polynomial regression. What does that mean? It means that instead of being just a straight line, a regular linear regression, we're going to make it a curve.

Let's take a quick look at our data here. We're in notebook 1.5, so join me there if you haven't. And for our X values, we have the hours of the day, or some hours of the day anyway.

The hours for which we have the data. And then this is the average toll booth speed of a car at those hours. You can see they vary wildly, and in the middle of the night, people are speeding too much.

Let's take a look at a scatterplot of that. So you can see a best-fit line is not going to do so great here. Because again, it's highest here, it dips down, and then it goes back up.

We're going to need to use some different tools to do that. If there's a linear relationship between two variables, linear regression is perfect. However, it's not perfectly linear here.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

It goes up, it goes down, and we can just see that this is not going to work. We will look at what a linear regression would even look like here. So the first thing we're going to do is we're going to take a slightly different approach.

Instead of making our own predict function, we're going to let NumPy do it for us. So here's how we're going to do that. We're going to say predict speed equals a function.

And it's the function that poly1d gives us. So poly1d is a bit like polyfit that we use to make a y equals mx plus b equation. But what poly1d does is it takes in those values, and it gives you back a function that can give you y from whatever the formula you get is.

If we give it the values, the tuples that np.polyfit gives us, we'll see what happens. We're going to polyfit hours and speeds and give it a degree of one. If I've got this predict speed, let's try predicting it.

Let's try writing this as a graph, plotting them all. We've got this predict speed—the function that NumPy gives us for doing this regression based on our X and our Y with a degree of one.

If I say, for example, predict speed at the hour of one, and I print it out, there it is.

That's a very bad prediction. At the hour of one, it should be a much different value. It's down, actually down here.

What if we predict speed at five? Five was down at 60. Nope, now it's at 70. So again, it's not a great graph, and we can graph it.

Let's take a look at it. We're going to make some hour ticks, and those are just going to be where our X's are on our graph. We'll use NumPy's linspace function to give us a maximum of 100 y and X from zero to 23.

It'll give us those ticks. And now we'll make some predictions. We'll say predict, run our predict speed—the function that NumPy gives us for doing this regression based on our X and our Y with a degree of one.

And we'll say predict speed from those hour ticks. Give me a prediction for each hour. And now we can plot that.

We've already got our scatterplot, but let's make a line. The X will be the hour ticks, the y will be the predictions. That is not great, right? There are very few numbers that are even near that line.

That's because we don't have a linear relationship. So this line is trying to make a best fit. It's trying to make a line that, you know, has the shortest distances overall to these dots, but it's missing wildly.

It's off on so many of them. So this is not working for us here. Now, that's because it's only one degree.

If we increase this number, it's now a cubic curve. Let's take a look at what that would look like. Ah, it's getting a little better.

It's getting a little better. It's a closer match. However, it's still not a perfect match, because we can see it goes down, and then it goes up, and then it kind of goes up more, and then down again.

So probably the best value here is to have it curve twice. You know, no curves—one; one curve—two;

Two more curves—three. That's pretty good. It's able to give us a regression that is much closer to it.

Okay, and we can actually, you know, check these as well. Let's add some little scatter points for some values that aren't in our hours. We look, four is missing.

17 is missing. Let's plot those from their predictions, from these predictions here. We'll say, also scatter onto this plot the X value four for hour four, and for the y of run the predict speed on four.

And we'll make it, you know, a reasonably sizable dot, bigger than normal, and actually make it a star. And we can see where the prediction would come in for four o'clock. By four a.m., people are starting to slow down.

Maybe they're very tired at that point. So this is how we can do a polynomial regression. Now, we want to make sure that we're not doing too many curves, because at a certain point, it becomes more about describing the curve than it becomes about predicting.

If I do that, and that, the line's getting closer. But is it actually going to predict values? People find that this is called overfitting. It means you're making things fit the data exactly, but that it's learning too specifically.

It's graphing too closely to its current data, and losing its ability to predict things that aren't in that data. And you can end up with some pretty crazy curves. But yeah—I don't think that's how it works.

Very close to the dots, though. All right, folks. We'll join you in the next lecture, in notebook two.

Key Insights

Colin Jaffe

How to Learn Machine Learning