Understand how to accurately calculate standard deviation using NumPy by distinguishing between sample and population data. Learn why adjusting degrees of freedom is crucial for meaningful statistical analysis.
Key Insights
- Use NumPy's built-in STD function to calculate standard deviation, noting that specifying degrees of freedom is essential: zero for population data and one for sample data.
- Recognize that when working with sample data—such as a subset of temperature or height measurements—setting degrees of freedom to one provides a more precise standard deviation, resulting in slightly higher values (e.g., 14.4 instead of 14).
- Understand that adjusting degrees of freedom is particularly important when analyzing variance, as sample variance calculations differ significantly from population variance calculations.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's take a look at how we can calculate standard deviation using NumPy. If we use NumPy's built-in STD for standard deviation and pass it a list, it will give us the standard deviation. Now, there is a big thing to consider here, which is that the standard deviation of the degrees list is 14 degrees if we're considering the degrees to be the entire population of degrees.
In other words, all the degrees that have been measured, which for degrees doesn't really make sense. I mean, any degrees is going to be a sample of the temperatures. Now, we might look at the entire population and the standard deviation of that.
When we're looking at something like height, then we can look at the entire population. We're looking at it on a sample, not just the height of random thousand men in America, but what is the height of men in America? If we're looking at, as we are in this case, a sample, a small subset of the overall population, we actually need to adjust for that to get the true standard deviation. To do that, we use a concept called degrees of freedom, which is, well, it's not worth explaining degrees of freedom.
You can look into that more if you want to dive into, hey, what exactly is the offset we use and where does it come from? When we're talking about population variance, the variance in the whole population versus the deviation within one particular sample of it. But degrees of freedom is set to zero normally. That's the population variance value.
If we set it to one, that's for a sample. If we look at that, we get a slightly higher number, 14.4. Now, this will matter particularly when we're looking at variance, where that will actually vary even more when we're looking at the value of variance. But this is a key thing.
We can use NumPy to do this. We just always be mindful of degrees of freedom. Are we looking at a sample of the population or the entire population as a whole?