Refine predictive models by effectively removing outliers from datasets, focusing on automotive sales data. Learn to assess improvements in model accuracy through strategic data preprocessing techniques.
Key Insights
- Removing price outliers above $80,000 from the dataset eliminated two rows, improving the quality of the car sales data.
- Filtering out vehicles with engine sizes greater than seven reduced the dataset by one more entry, resulting in an optimized set of 150 rows.
- The process outlined involves redefining variables X and Y, re-splitting into training and test sets, and retraining the predictive model to assess the impact of outlier removal on accuracy and performance.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's go back a step and look at, instead of X, looking at cars, our overall car sales. That has all of our values, and again, still 153 rows. This is before we split it off into X and Y, because we want to look at rows where engine size is more towards the norm, and price in thousands is more towards the norm.
Remove those outliers. We're at 153 rows. What we'll say is car sales equals car sales, where the car sales column of price in thousands was less than or equal to 80.
All right, that cut out two rows, two outliers where the price was greater than 80. Let's do one more. This may remove the same outliers.
We might not actually see less. Let's see. Car sales, let's remove also car sales where the column engine size is less than seven, less than or equal to seven.
Yeah, that removed one more row. You can see down here, it changed from 150, one to 150. All right, so we've removed a couple of outliers.
Now our next step is to take that and redeclare our X and Y and our train and test, and then retrain our model, train a new model, and see how it compares. Let's take a look.