Analyze the Titanic dataset to understand passenger details, survival factors, and data quality considerations. Learn effective approaches for handling missing values and preparing data for predictive modeling.
Key Insights
- The dataset includes information about 891 Titanic passengers, with survival status (1 for survived, 0 for died) serving as the target variable.
- Cabin data is missing in approximately 75% of cases, thus it will be excluded from the predictive modeling.
- Age, missing in nearly 20% of records, is considered valuable for prediction, while embarkation port data has minimal missing values that can be managed during preprocessing.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's take a look at this data and see what we can see is happening with it. First off, there are 891 people that we have information on on Titanic. We have quite a few columns to dive into here.
Passenger ID, survive zero or one. This is our Y here, this is our target. Did they survive or not? One is they survived, zero is they died.
P class is the passenger class, for that particular passenger. First class, second class, third class. Their name, their sex, their age.
This is siblings and spouses, as in people in the same generation. And this part is parents and children, people of different generations. We have ticket, how much they paid in their fare, what cabin they had, and their port of embarkation.
There are three places that they could have gotten onto the Titanic from. Let's take a look at which of that data is good. We're going to take a look at the Titanic data's isNA, which values are not available.
And we have a couple that we need to deal with. Cabin, we are going to ignore. It's missing in almost all the data.
In three quarters or so of the data, it's gone. We're just not going to use it in our final when we make our X to train and test our data on, train and test our model on. Age is only missing something like 20% of them.
And we'll deal with that. We want age. It could be a pretty good predictor.
And embarked is only missing a couple bits of data. We'll deal with those as we go. But that's where we're starting with, with our data.
Next, we'll dive into how we can use that data and what we can do to clean it up.