Train-Test Split for Predictive Modeling in Python

Effectively splitting your dataset into training and testing subsets is crucial in machine learning. Learn how to correctly divide your data using standard naming conventions and scikit-learn's train-test-split function.

Key Insights

Split data into inputs (features, labeled X) and a target (price, labeled Y), then further divide each into training and testing sets, typically using an 80/20 split.
Use the standard naming conventions—X-train, X-test, Y-train, and Y-test—to clearly indicate the purpose of each subset and ensure compatibility with common machine learning practices.
Apply the train-test-split function from scikit-learn to randomly shuffle and partition your dataset, ensuring the correct order of returned values (X-train, X-test, Y-train, Y-test) for proper data alignment.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

We've split up our data into X, which is our inputs, our features, and Y, which is our price in thousands. Now that we've got those, we need to talk about training and testing. We have our features and our target, but we also want to split things up into our training and testing data.

We're gonna name it X-train for our 80% of the data that it's trained on, and Y-train for the answers that it will target, the targets, in fact. Now, we'll also use X-test and Y-test, and these are the standard names. If you name them anything else, you're doing it wrong, because that's not what those things are called.

So don't come up with fancy names for these. These are the standard names for them, and then other programmers know what those values are. So let's talk about how we split those up.

We already have our X and Y, and it's 100% of the X and 100% of the Y. We're gonna split each of these up, and here's how we split them. Let's take a look at this image. If you execute this block, we're gonna use train-test-split is the name of it.

Here's our full dataset. We've already done this. We've split it up into features, X, and target, just one column, Y. Our inputs, our characteristics of the car, our prices, the end goal.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Now, we split features up into X-train and X-test, and again, X-train is about 80% of it, and X-test is 20, and we split this target, this Y, into Y-train and Y-test, and the reason is that we can then take this and show the model our, here's these rows. This should be the goal. Make a formula that goes from these inputs to this, and then let's test it with this new data, this new X data.

Can you get the correct Y data, the correct targets? All right, so the actual code itself is fairly simple. It's maybe a little more complex than some of these. It's not that hard.

It's just going to be that it gives us a tuple, so we're gonna call train-test-split, and again, that's a function from scikit-learn that's gonna split up X and Y for us, so we're gonna call that function and give it X and Y, and finally, how big should our test size be, and the standard is 0.2 of the data, in other words, 20% of the data, so it's gonna take the X data and split it up into X-train and X-test, the Y data and split it up into Y-train and Y-test. The order that we pass it in matters, as with any function call, but also, this is actually gonna give us back a tuple. If I look at what's the type of this, it's a list.

That's actually technically a list, not a tuple, but yes, we're gonna unpack that list, so we're gonna say X-train, X-test. That's the first two values it returns. The next two are the Y-train and the Y-test, and again, these are the standard names for them, and it has to be in this order.

If we are confusing our, if we're putting the wrong values in the wrong places, then we're gonna end up with Y data for our X and X data for our Y, and our training data will be trained against the test data, which is incorrect. It'll be all kinds of wrong, so we wanna make sure we get those in the right order, and if we look at those, they should all be the same. The X-train data, if we look at its length, 122 rows, it should be the same for our Y-train, 122 rows.

For our Y-test, 31 rows. Again, that's 20% of the data now, and our X should also be 31 rows. Nope, because I didn't capitalize it.

There we go, 31 rows. Okay, so great, and you can confirm what is X-train. I'll show you the head and the tail, the first five and the last five, and we can see it's just the columns we want, and the same for Y-train, just the columns we want.

Now, you notice these are not in order anymore, these numbers on the left, the row numbers. That's because not only does it split it up, but it does it randomly. It shuffles it up.

I'll show you what I mean. Let's put here our X-train and our, let's, yeah, let's show our X-train. It starts with row 90, 134, 46.

Our Y-train, same numbers. Again, they're shuffled, but they got split up so that they line up. The first X, set of fuel efficiency, horsepower, and engine size, goes with the first Y, row 90 of Y, disk press.

So it'll know, it'll look at the training data and know what the answer is for each one. And that's how you split up our testing and training data. One line of beautiful code.

Key Insights

Colin Jaffe

How to Learn Machine Learning