Prepare a linear regression model to accurately predict car prices using key attributes such as fuel efficiency, engine size, and horsepower. Learn essential steps, from cleaning your dataset to structuring inputs and outputs for optimal model performance.
Key Insights
- Cleaned the car sales dataset by removing unnecessary columns ("sales in thousands") and dropping rows with missing values, reducing the dataset from 157 to 153 usable rows.
- Established the features dataframe (X), consisting solely of fuel efficiency, horsepower, and engine size, prepared explicitly for training the linear regression model.
- Defined the target variable (y) as the "price in thousands" column, providing clear outcome values that enable the model to learn relationships between car attributes and pricing.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Now that we've done some data analysis, we've done some work with our domain knowledge, we've picked out which things are important, we're going to narrow things down, we're going to clean up the data, we're going to split it into training and testing data, we're going to standardize it, we're going to do all the things that we need to do. That's a big part of getting data in the right shape for the model to be the most helpful so that the model can do its best job. All right, so that's what we're going to do.
We're going to train this linear regression machine learning model to predict car sales based on three variables, fuel efficiency, engine size, and horsepower. Again, not sales in thousands, that does not seem to be an important factor. We'll save that in a new data frame called X, capital X, and that's just a standard name for your features data frame.
We'll also save the price column as Y, and that's just a series, it's one column. And again, that's lowercase y, and that's going to be our answers, our prices. We are going to be able to split this data and get it a little cleaned up.
The very first thing I want to do, though, is clean this data. We're going to say, remove extra columns and rows. To get rid of the column of sales, sales in thousands, we're going to just make a new car sales, or rather we're going to re-assign car sales to be just some of the columns, everything but sales in thousands.
So here's the old version of it. Notice I grabbed too many brackets. We still want two, but there we go.
But we're going to get rid of the sales in thousands. We're going to just have fuel efficiency, horsepower, engine size, and price in thousands. Now it's a good idea to drop those values that aren't what we want.
Let's take a look at what I mean here. If we say carsales.isna.sum, we'll have it sum up for us how many values are not usable, how many are not a number, or how many rows, rather, of those. There's just a few.
There's a couple fuel efficiency rows, horsepower and engine size, those might even be the same car, and two of them don't have a price in thousands. Those are not usable. We want to drop those.
Right now, if we look at carsales, it's 157 rows. We can drop unusable rows, ones with NA values. We'll say carsales.dropNA. Actually, we will say carsales equals carsales.dropNA, because dropNA returns a new data frame.
It doesn't change the original one. Then we'll look at carsales again. Now it's 153 rows.
We lost four cars, but our data is a lot cleaner. There's no ones that are just unusable here. All right, let's take a look at the next step.
We removed sales in thousands. We dropped the unusable rows, and it's important we do this in this order, because if we drop unusable rows, we might drop ones where they have a not good value for sales in thousands, but we don't care about sales in thousands. We don't want to drop ones that have non-usable values for columns we're not even going to use.
So it's important to get it down to what you want first and then drop them. All right, now that we've got that, we're going to make our x value. This is, hey, these are the values we want, and it's just going to be a copy of carsales, because we've already got in the format we want.
It's not going to be that, because x is going to be our features, fuel efficiency, horsepower, and engine size. It's going to be these three columns, but not the price, because we're talking about the inputs here. What are the inputs? Okay, so what I actually want is carsales at everything but the price in thousands, like so.
So fuel efficiency, horsepower, and engine size. That's our x, and if we look at x, we'll see. It's based on the version of carsales that has already had sales taken from it, and has then dropped its NA values.
So that's why it's only 153 rows now, but it's just the data we're going to look at as our inputs. All right, next we'll work our training data. Why is the, sorry not our training data, our output, our label, our prediction, our answer in quotes? This is, you know, the answer to each one, right? Given this fuel efficiency or given for index zero the fuel efficiency, horsepower, and engine size, what was the answer? What was the price? Let's take a look.
We'll make it. All we need to do is y is carsales at the column price in thousands, and we'll take a look at y, and it should be the exact same length, 153 rows, and we can see that, you know, for each one, we're going to give the model both of these. Hey, here's row zero.
It has these three numbers, this is the answer you get with those three numbers. Okay, now look at number one, look at these numbers, see how it's got a much higher horsepower and engine size and a lower fuel efficiency? Here's the price of that one, and through this it'll learn, hey, from x I get y, from x I get y, and it'll start to come up with a formula for how to predict that. Okay, let's talk in a moment about splitting this up and testing and training data.