Understanding the MNIST Dataset in TensorFlow

Gain clarity on handling complex data structures using TensorFlow's built-in MNIST dataset. Learn how to effectively analyze and manage multidimensional tuples for training and testing neural networks.

Key Insights

The MNIST dataset provided by TensorFlow's Keras library is structured as a tuple containing two tuples, one for training data with 60,000 images and labels, and another for testing data with 10,000 images and labels.
Each image in the dataset is represented as a 28-by-28 matrix, with corresponding labels indicating the numerical digit (0–9) they represent.
The dataset is immutable, stored as tuples rather than lists, ensuring data integrity by preventing accidental modifications.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's take a look at our data and understand its shape. The MNIST dataset is not in a CSV file like a lot of the ones we did before. It's a famous dataset built into TensorFlow.

And every datasets module has a load data method for loading that particular dataset. And we're also going to take a look at the shape of it. Because it's a somewhat complex shape, we have to understand it to work with it.

So let's grab the digits module. What we're going to call it, the digits module. And it's going to be TensorFlow's Keras library.

It's datasets, of which it's got quite a lot. If we look at these, there are quite a few pretty well-known datasets here. We're going to be working with the MNIST one.

And if we check the type of this digits module we just made, we'll see its type is module. Which just means this is a library. This is a package within TensorFlow.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

It has a load data method. And we'll call that digits data, maybe. Grab the digits module and run its load data method.

Saving its results as digits data. Okay. Let's check the data type of it.

This is the data, not the module itself. It is a tuple. What is a tuple? It's like a list, but immutable.

Meaning we can't append to it or change something that's at an index. Which makes sense. This is data, pure data.

We wouldn't want to say, like, oh, that seven is now an eight. Right? There's no reason we would change this data. So they give it to us as an immutable tuple.

Let's take a look at that tuple shape. So it's a tuple, great. Let's get the length of that tuple.

Length of digits data. So it's a tuple. How big a tuple? Ah, this is not the last thing.

Let's print it. Two. So only two things.

Let's print length of overall tuple two. Let's get the, if there's two things, they're index zero and one. Just like with a regular list.

So let's print the length. Let's go in the order of that, of what it says there. Let's print the type of digits data zero.

And let's just see what that is. Let's start there. It's a tuple.

We've got a tuple of tuples. There are two items in this tuple. And the first one is a tuple.

And I'll, spoil the surprise, the second one is also a tuple. So it's a tuple of tuples. Okay, what's the length of that tuple? And what's the length of this one? Okay.

They both have two items in them. We have tuples of tuples. And each of those tuples has two things in it.

All right. So what are these actual things? Let's print them out and see in this next step. Digits data zero is our training data.

And the first item in our tuple is our training data. And the second one, index one, is our testing data. And they're broken up into roughly 60,000 and 10,000, which is, you know, and not just roughly exactly, but that is roughly 80-20.

It's a little less. It's a little more than 80%, rather. And a little less than 20, but it's good.

So, but each of those training and testing data is each split up into two different parts. And I think you could guess why. This is our X trade.

60,000 28-by-28 matrices. We'll figure, talk about what those are. And 60,000 just answers.

And these 60,000 are the digits that they represent. One, two, five, eight, whatever. Then 10,000 are our X test.

10,000 28-by-28 arrays. And 10,000 answers. Okay.

In our next step, we're going to make them into, if you had trouble with all that, you know, data within data, well, great, no problem. We're, you're going to get good, if you aren't already, at analyzing the shape of data. But we're going to make some variables in our next step that'll make this a little easier to follow.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Machine Learning