Correlation Matrices in Data Analysis with Pandas

Remove "sales in thousands" due to its weak correlation with other variables.

Uncover relationships within your data using correlation matrices, a powerful analytical tool provided by pandas. Learn how identifying correlations can refine your analysis and guide strategic decisions.

Key Insights

  • A correlation matrix is a pandas-generated tool that quantifies the relationship between variables on a scale from -1 (perfect negative correlation) to 1 (perfect positive correlation), helping analysts easily identify variable relationships.
  • Horsepower strongly correlates with both price and engine size, indicating that higher horsepower typically means larger engine size and higher price; conversely, fuel efficiency negatively correlates with price and engine size, indicating that more fuel-efficient vehicles tend to have smaller engines and lower prices.
  • Despite initial expectations, data analysis reveals that "sales in thousands" shows minimal correlation with price, horsepower, and engine size, suggesting this variable may not be valuable for predictive modeling.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

We're looking at this data and we've made some decisions as to what we think is important, but let's do a little data analysis. Let's figure out what is correlated with what. Now, we can check that with a correlation matrix.

A correlation matrix is a way to tell what values are correlated with what other values from a table, from a pandas data frame. And it gives you back a new data frame where each item is being compared to every other item. If it says 1.0, it's perfectly correlated.

That's because it's comparing it to itself, right? Like the more the price goes up, the more the price goes up. Like, yeah, obviously that's the relationship one-to-one because it's the same thing. It's looking in a mirror.

So the other values will range from one, unlike this other value actually is perfectly correlated, to negative one. It's perfectly correlated, but in the opposite direction. So the more, as price goes up by a dollar, sales go down by a dollar, right? But most of them will be in between.

If it's zero, then it's not correlated one way or the other. These two variables have nothing to do with each other. And if it's a little more towards, a lot more towards one or a lot more towards negative one, then they're more correlated.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

It doesn't matter what direction they're correlated in. It just matters, like do these, is there a pattern here? All right, so this is a great way to visualize things, and it's something given to us right by Pandas, because Pandas is wonderful. All right, so we're going to say carSales.correlation. Give me a correlation matrix.

Let's take a look at that. Okay, now this is a lot of numbers. We'll find out, we'll look at a way to visualize this in a moment.

But you can see there's, running down the diagonal here are all these 1.0s, right? Again, that's comparing horsepower to horsepower. They are perfectly correlated. But the other ones are more or less correlated.

Horsepower doesn't seem to affect how much overall sales you have from selling that car. If we look at horsepower, and we go over to engine size, they're pretty correlated, which makes sense. The bigger the engine, the more horsepower it's going to have.

It's also pretty well correlated, though, with the price. If you look at horsepower, it explains a lot of price. Surprisingly, fuel efficiency could be useful, but in the opposite direction.

The less fuel efficient something is, the more it costs. The higher the price in thousands. Also, negatively correlated, the more fuel efficient something is, the less the engine size is.

And again, these numbers are approaching one, means that there's a high correlation there. What about some of these ones that are very highly correlated, right? We can definitely see that horsepower is very related to the price, right? So that seems like it could be important. Again, this is one of those things where our domain knowledge has paid off to some degree.

Price in thousands, for example, that's something we figured would be related to horsepower, and it is. But price in thousands related to engine size, less correlated. Horsepower seems to matter more.

Fuel efficiency, in the other direction, does seem to predict price in thousands, but again, to less than the other ones. And sales in thousands is the least predictor of these. So looking at this data analysis, we might think sales in thousands doesn't really belong here.

And also, honestly, the domain knowledge makes sense there, right? Because these things are actually sort of a very complex relationship, right? If something costs a lot, does it mean that people are gonna spend more overall money on it, right? This is the price of one of them, price in thousands, and this is how much money the company is making from that model, right? And so there's, hey, maybe at a certain point, the price goes up too high and people just aren't buying them. And maybe if the price goes too low, the sales also go down because you're just not making that much money off of any one car, right? So it's not very well correlated, and it's like, oh, that makes sense. But if we thought domain knowledge, like yeah, sales in thousands, price in thousands, those seem like they'd be very related, we can look at the actual data analysis here, and it's like, no, these things are not very highly correlated.

So I think what we're gonna do now is we're gonna take a look at this another way, and I think we're gonna end up, based on this data analysis, saying, hey, sales in thousands, I think it's not gonna make the team. It's not gonna be an important thing to train our model based on.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master machine learning with hands-on training. Use Python to make, modify, and test your own machine learning models.

Yelp Facebook LinkedIn YouTube Twitter Instagram