Creating Predictions for Kaggle's Titanic Challenge

Create accurate predictions using Python and Random Forest classifiers, then evaluate your model's effectiveness by submitting results to Kaggle. Learn the complete workflow from preparing prediction arrays to formatting CSV submissions for Kaggle's Titanic competition.

Key Insights

Created a prediction array by applying model.predict on the test dataset, generating an array consisting of zeros and ones indicating passenger survival.
Prepared a submission CSV file conforming strictly to Kaggle's format by retrieving the passenger IDs from the original Titanic test dataset and including predictions, explicitly setting index=False to exclude unwanted index columns.
Submitted the CSV file to Kaggle's Titanic Machine Learning competition, achieving an accuracy score around 77–79%, providing students motivation to explore improvements and adjustments to further optimize the Random Forest classifier model.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Okay, let's pick up right where we left off. We're going to create a predictions array. Let's call it predictions.

And it will be what happened when we run model.predict of our X test. And it is 400 and something zeros and ones. Not very helpful for us without context for whether we got these correct or not, but we're going to use that in our next step, which is to create a data frame that will have passenger ID and predictions.

Now we're going to submit this to Kaggle and it has to be in this exact format so that it's algorithm can give us, can check it against the Y test answers and give us an accuracy score. We need our passenger ID. I foolishly removed, overwrote the X test and got rid of the passenger ID, but we can get it back.

What we're going to do is simply read from the test CSV again and get that right back. So I'm going to create a Titanic test data frame and it's reading CSV from our base URL plus CSV slash, I believe it's called, yeah, testTitanic, titanic.csv. And we'll double check that. Yep, it's got the passenger ID that I got rid of in X test.

Okay, great. Since we've got that, let's now make a data frame where, well, yeah, let's make a data frame called our Titanic submission data frame, maybe. Sure.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

So Titanic submission, submission, sure, is a new data frame and it's got a passenger ID column that should equal the Titanic test data frame up there's passenger ID. Then we're going to include a survived column and it's going to equal our predictions from up above. These zeros and ones up here.

And then we can just check our submission. Looks pretty good. Passenger ID and zeros and ones.

Just those two columns is all that Kaggle wants. Now to save it as a CSV is a little bit of work, but not too bad. We want to save it to Google Drive in our case and then download it.

And we're going to make sure to set index of false. If we don't do that, then we'll get another column that'll be these indexes here. We don't want that.

We want only passenger ID and survived as columns in our CSV that we're uploading. We're going to say Titanic submission, not read CSV, but to CSV. And we're going to save it to our base URL on Google Drive plus CSV slash, slash Kaggle submission dot CSV.

And finally, index equals false so that we can only have those two columns in it. Perfect. Okay.

Run that line. Now we're ready to submit that to Kaggle. It should be downloaded to your Google Drive.

Let's check it out. Here's my Kaggle submission dot CSV, but if you need to find it, it's in my drive. It should be in Python Machine Learning Bootcamp, CSV file, Kaggle submission dot CSV is what I just named it.

So I'm going to now download that. Right click on it. Yep.

Download. And yep, it's downloaded. Now I'm going to go to Kaggle and we're going to submit it.

If you don't have a Kaggle account, you will need one for this step, but also you should get one. You should get a Kaggle account. Kaggle is fantastic.

It's a big part of the machine learning community, and it's a great place to learn. What you're going to do is find the Titanic competition. If you search at the top, let me walk through that a little bit more.

You go to competitions and top type in Titanic. And we have Titanic under competitions, Titanic machine learning from disaster. And what you're going to do is you're going to submit a submit our CSV.

So you can go over to submissions. You go to submit prediction up here. And you can go to find the file where you downloaded it.

And now it'll run it and it will then give you a score. Should be around 79%. Here's the one I just did.

Ooh, down to 77%. I must have done something different. Okay, so that's a fine score.

It's a fine score. It's a great jumping off point for, hey, how do I get my score better? And again, because this is a competition, I feel like it gives people the drive. I've found that this gives students a drive to learn more on their own.

How can I get my numbers higher? How can I get a better score? What makes the score better? What can I change? What can I tune? And it's a great start towards that self-improvement. So that's random forest classifiers. That's Kaggle.

And we'll continue with the next lesson.

Key Insights

Colin Jaffe

How to Learn Machine Learning