Model Accuracy: Impact of Outliers and Dataset Size

Enhance your model's accuracy by effectively identifying and managing outliers in your dataset. Gain insights into how dataset size and variability significantly influence the reliability of model predictions.

Key Insights

Removing outliers from a dataset slightly improved the linear regression model's accuracy, but the improvement was marginal and inconsistent across multiple test runs.
Small datasets of around 150 rows, when split into training and test sets, exhibited considerable variability, leading to inconsistent prediction accuracy ranging from as low as 44% to as high as 79%.
Future analysis should focus on larger datasets, as limited data contributes significantly to prediction variability and impacts model reliability.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's try this model again and see if we get a better answer without the outlier. We may not. Let's give it a shot as part of exploring our data.

Let's create an x and an x2 or an x2 and a y2. First x2 is our car sales with the columns fuel efficiency, horsepower, and engine size, but not price in thousands. That's wrong.

All right, and our y2 is our car sales at price in thousands. All right, now let's split that data. We'll say x train.

You know, this is a great time, great opportunity to show, hey, we don't always have this memorized. The order I said was really important. What we're going to do here is we're going to call x, we're going to call train test split, and we're going to pass in our x2, our y2, and our test size of 20%.

This returns a topple, but I literally every time forget what the order is of train versus test, and so it's really worth always making sure you get this right, and you're not expected to have to memorize the structure of the return object every time. So let's go back to where we're doing it. It's x train, x test, y train, y test.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

So x train, x test, y train, y test. Unpack those from those values, and we should name them, there we go. It was mad at me because I had a slight indentation there, and that's a problem because it seems like it's in a code block.

That's Python. Yeah, Python indentation, something you need to know. Okay, let's run these.

You can see I failed to actually run them. It's very important that you do that. Let's scale our x values now.

x train equals standard scaler.fit transform of x train, and the same thing for x test. Again, here we are scaling things so that they're all centered around the mean of zero and are measured in standard deviation. Okay, run that, and now what we're going to do is make the model.

Let's call it model two, and it's a linear regression. Let's train the model, and what did I do wrong here? Oh, we train it on our data, of course. Our train data.

It needs some data to train on. Can't just be like, no, train on nothing. All right, now let's test it and see how it did without the outliers.

Might have been the same. We'll skip right to, what's the score? Score two is model two, their score method. We give it the x test and the y test data.

Predict the x test data and measure it by accuracy against the y test data, and it's significantly worse. So, here's the funny thing. If I run this again, I'll get the same value.

However, there is a fair amount of randomness here. If I run this from here down, run all of these, we will actually get a different number, and that's because sometimes when we're working with small sample sizes, as we are here, we're at 150 something rows is not really that many. It's not that much data in the grand scheme of things.

So, when we split it up 80%, 20%, which is what this line does, sometimes we can get quite a variance of how much does the training data predict the test data. How much alike are these two pieces of data? We're talking about 30ish random numbers from the 150. So, how much alike the original population, the training population, are they? So, if I run this and all the code below it, and there's a command here, run time, run cell and below, now it's 79%, which is significantly better than we did.

Now, there are other ways we could measure this. Now it's 60%. Now it's 69%, only very slightly above the original.

44%, another bad one. Right? So, these numbers are generally very good here. But the question is, is it even better than having removed the outliers? And the answer is, like, a little.

I've run this 10,000 times in a loop and seen what we get. Seen what, you know, on average, which of these models, model one or model two, scores better. Model two scores slightly better without the outliers, but only very slightly.

Now, this highlights two things. One, the outlier is, it helps removing that outlier. It might help more if the outlier if there were really, really good reasons for removing the outlier or it might actually hurt if there were bad reasons for removing the outlier.

But that's one technique we have in our tool chest to improve our model's accuracy. Removing outliers. Another one that's very important that we'll be doing pretty much from now on is we'll be working with bigger datasets.

This high amount of variance here comes from a large, a real problem with this data, which is just that there isn't enough of it to consistently get a good model out of it. We'll take a look in future lessons at much bigger datasets and see what we get from those. All right.

We'll move on to the next document, the next notebook. I hope that you took a lot out of training your first model and looking at its score and thinking about ways we can improve it.

Key Insights

Colin Jaffe

How to Learn Machine Learning