Understand the importance of data normalization and its practical application in machine learning. Learn how normalization helps standardize data sets, enhancing model accuracy and efficiency in Python.
Key Insights
- Normalize data to standardize different features onto a comparable scale, such as transforming age ranges (20–80) to a scale of 0 to 1 and debt amounts (0–200,000) to a similar scale for effective comparison in predictions.
- Implement pixel normalization in image data by converting pixel values from their original range (0–255) to a normalized range between 0 and 1, enhancing how machine learning models interpret and process image features.
- Utilize Python and NumPy's powerful vectorized operations to efficiently normalize large datasets, exemplified by the quick handling of 47 million numbers across 60,000 training images with 784 pixel values each.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's talk about normalizing data. Now, we previously did this with our standard scaler object and its method fit transform. And what that did is it put things on a standard deviation scale and set the mean to be zero.
That's useful when we're comparing values across the board. In other words, different values that might have different scale. So a good example for that is, hey, how likely is someone to repay a loan given their age and how much debt they have? Now, if Bob is 50 and he's $50,000 in debt, then a network might say, well, this 50 doesn't seem like it's very important.
It's a pretty small number. And this on the other hand is, it looks like it's a thousand times more important. We normalize the data.
If our sample population is from 20 to 80 years old, if Bob is 50, then his age right in the middle might normalize to 0.5 on a zero to one scale. And if people in our population have our debts from zero to 200,000 and Bob's 50,000 and he's a quarter of the way there, good luck, Bob. And his debt is normalized to 0.25. This is a slightly different scale, slightly different idea but not much different than the standard scale or mean at zero and standard deviation system.
Now, here we actually have values that will tend to be repeated at different levels. So in other words, a lot of them will be zeros at the outsides of each of these pixels they're gonna be black. People generally are gonna keep their, the digits are in the middle.
So some of those pixels will be less important than others but pretty much exclusively all the digits in the middle are gonna be equally important. We don't want the system to think that this pixel seems to always have a high number. It seems like that pixel is important.
So instead we're gonna make black pixels zero as they are now, instead of white pixels being 255, we're gonna make white pixels one. And anything in between like a 122 would normalize as 0.4765, right? That's 122 divided by 256 possible values. We're going to be able to divide it and get a decimal number, a float between zero and one to normalize this.
We won't have to use standard scale. We'll just do it on our own. And yeah, it's pretty easy to do.
Here's our code for doing it. This is a fairly standard vector operation in Python. We're gonna take our training images and when we say divide a array by 255, we're saying divide all of them by 255 individually.
Perform this math operation on every single item in the list. And this is part of what makes Python and particularly NumPy very powerful. And it's able to do this math even though it's doing 47 million numbers and dividing them all, all the images in all the training, 60,000 training images, each one has 784 values in it.
It's able to do this very quickly. And then we're gonna print out a random value, 12,345. It's one, two, three, four, five.
And we'll take a certain row of that pixel and even just certain pixels of that. And you'll see that they all should be between zero and one now. And again, blink and you miss it.
It's gonna divide 47 million numbers by 255 and it just did it. Wow, computers. All right, so here they all are.
You can see that the zeros are still zeros. And this one was almost 255. This one was almost 255, et cetera.
All of these, those might have been 255s actually. That's my guess. Because we're divided by 255.
Yeah, I guess they're a little under 255. And then it's taking them all and it is scaling them down. So this one's a pretty low number.
You know, I don't know, maybe 20. And this one's maybe 50. I can't do the math on that.
Nor fortunately do I have to. But we are getting the same numbers scaled down between zero and one. And we'll do the same thing for the testing data as well.
And run that. Now we have normalized versions of the testing images and training images. Let's talk about defining our model next.