Histograms: Understanding Uniform and Normal Distributions

Gain clarity on how histograms visualize data distributions, and understand the difference between uniform and normal distributions. Learn how sample size significantly influences the accuracy of histogram visualizations using practical Python examples.

Key Insights

Histograms visualize frequency distributions by placing data points into ranges or "bins." The article illustrates creating histograms using Python's NumPy and Matplotlib libraries.
Analyzing two datasets of random numbers (one with 500 samples and another with 100,000 samples) demonstrates how increasing sample size results in a smoother, more uniform distribution.
The article explains uniform distribution, where values are equally probable across the range, contrasting it with normal distribution, which clusters around a central value.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Briefly, before you move on, it's a good idea, just for your own sanity, to comment out that code we just wrote, just because if you want to run this document again, you want to run this Python notebook, you will be annoyed if you constantly are being, it's being paused in the middle to prompt you for percentile ages. So good idea not to run this for now. Comment it out for sanity.

Now let's talk about histograms. Histograms show frequency distribution. How many items fall within a specific range? We give it some ranges, we give it some ranges, we see how many are in each.

And this is going to be, it's going to start to look very familiar to you in a little bit. Each range is a bin, that's what it's called for a histogram, and it's usually graphed as a column. Uniform distribution is when all values are more or less uniformly distributed as by random.

It's different from normal distribution, and we'll see that as we go. So first, in order to graph one, let's make a uniform random values. And there is a, fortunately, a NumPy method just for calculating this exact problem.

We'll use NumPy.random, and it has a bunch of different methods on random. We have something more specific, which kind of random method are we running? It's the uniform distribution method. And what we pass it is the range.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

I want values between zero and 10, and how many do we want? The answer is 500. Now these are going to be not just the integers, these are going to be floats, these are going to be decimal numbers between zero and 10. Let's save that into a variable.

Random 500 seems fine, it's 500 random numbers. With, again, a uniform distribution, meaning no numbers should be more likely than others. So let's print out the length of that.

No, not the length of 500, that's not what we're looking for, length of random 500. There are, indeed, 500 things in there. Let's keep that printing.

Save that printing. And let's print out a slice of 30 of them. We'll say rand 500, and we'll use Python's slice to get, say, the first 30.

And here they are. And you can see they're fairly randomly distributed. They're low numbers, they're high numbers, they're middle numbers.

If we get another slice, say we get from negative 30 on, this is the end of the list, we can see a very similar pattern. All right, let's graph it. We visualized our random numbers, there are 500 of them.

Let's make a histogram of it. Now, a histogram, again, is going to put them all in little bins. And we've already imported PyPaw as PLT.

And it has a histogram method on it. We pass it in an x value, meaning, you know, what's our data for this? And it's our 500 random numbers. And we also say, how many bins should there be? And we'll evenly distribute them into 10 bins, meaning 0 to 1, 1 to 2, 2 to 3, and so on, up to 9 to 10.

And then for, if you haven't worked with PyPaw before, it's fun, we could just say, PyPaw, go show us this plot. And here's what we get. They're fairly evenly distributed.

Looks like there are fewer in that 2 to 3 range than in, you know, the 7 to 8 range here. But each of these is a bin, 10 bins, meaning, how many were in there? So of the 500, a little fewer than 40 were in this 2 to 3 range, whereas for the 7 to 8, they were over 60. So it's definitely not evenly distributed perfectly, because again, it's still random, but it's not trying to cluster.

It's not trying to give us a random cluster like age, where we would, you know, all be around the same age, roughly. There's, you know, big clustering by age, not very few people are over 100, for example. And then, you know, or height is another even better example.

Now, due to there only being 500, and, you know, sample size is always, you know, humans think 500 is a big number. It's not very many. If you're looking for true randomness, let's take a look at a higher number.

Let's do 100,000, right? I don't, is that 20 times as much? 2,000 times as much? I think that's 200, somewhere in there. Math is, formulas are good, arithmetic, not so much. Let's take a look at, let's make a random 100k variable.

And it will be the same kind of calculation. We're still going to get a number that floats between zero and 10. But let's get 100,000.

And let's make a histogram of that data and 10 bins. And then let's show that. All right, so given a higher sample size, we see much less variation among these.

They're all right around that 10,000, right? And, you know, this might be just eyeballing it, you know, 9,800, 9,900, something like that. And this is, you know, probably 10,000, you know, 500, maybe. Again, just eyeballing it, right? Much less variation proportionately than this.

It's smoothing out to a more even distribution over time, the larger the sample size. And we could get an even more distribution if we just keep bumping that up. Let's take a look next at normal distributions.

Key Insights

Colin Jaffe

How to Learn Machine Learning