Refining Data: Removing Outliers for Improved Model Training

Remove outliers from car sales data and retrain the model.

Refine predictive models by effectively removing outliers from datasets, focusing on automotive sales data. Learn to assess improvements in model accuracy through strategic data preprocessing techniques.

Key Insights

  • Removing price outliers above $80,000 from the dataset eliminated two rows, improving the quality of the car sales data.
  • Filtering out vehicles with engine sizes greater than seven reduced the dataset by one more entry, resulting in an optimized set of 150 rows.
  • The process outlined involves redefining variables X and Y, re-splitting into training and test sets, and retraining the predictive model to assess the impact of outlier removal on accuracy and performance.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's go back a step and look at, instead of X, looking at cars, our overall car sales. That has all of our values, and again, still 153 rows. This is before we split it off into X and Y, because we want to look at rows where engine size is more towards the norm, and price in thousands is more towards the norm.

Remove those outliers. We're at 153 rows. What we'll say is car sales equals car sales, where the car sales column of price in thousands was less than or equal to 80.

All right, that cut out two rows, two outliers where the price was greater than 80. Let's do one more. This may remove the same outliers.

We might not actually see less. Let's see. Car sales, let's remove also car sales where the column engine size is less than seven, less than or equal to seven.

Yeah, that removed one more row. You can see down here, it changed from 150, one to 150. All right, so we've removed a couple of outliers.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Now our next step is to take that and redeclare our X and Y and our train and test, and then retrain our model, train a new model, and see how it compares. Let's take a look.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master machine learning with hands-on training. Use Python to make, modify, and test your own machine learning models.

Yelp Facebook LinkedIn YouTube Twitter Instagram