Enhance your predictive analysis skills by mastering logistic regression techniques using key employee metrics. Learn how to effectively clean, scale, and split your data to build robust logistic regression models.
Key Insights
- The article outlines how to prepare data for logistic regression by selecting relevant columns such as "low," "medium," "high," "satisfaction level," "average monthly hours," and "number of promotions in the last five years."
- It emphasizes the importance of scaling numerical data to address variations in scale, demonstrating this by applying the Standard Scaler to training and test datasets.
- The article discusses a structured approach to splitting the dataset into training and testing sets, typically allocating 20% of the data for testing to effectively evaluate the logistic regression model's performance.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Now that we have our data in a pretty good format, we did some data analysis, now we can think about our domain knowledge. And we're going to try, and again this is the kind of thing that you should work on, that you're welcome to keep working on thinking through, and encourage in fact, to continue thinking through which of these columns will help, trying out different amounts, massaging the data in any way you want, looking at outliers, and any of the other tools that we'll look at. But we want to show you, we want to talk to instead, we want to speak to the amount, we want to speak to what's different here with a logistic regression instead of a linear.
So let's make that happen. We're going to use for our x the following columns. We're going to use low, medium, and high.
Then we're also going to use satisfaction level, average monthly hours, and that is the way the original column is spelled. And how many promotions, oops, it's got to be a string, how many promotions they've had, pardon me, in the last five years. Y, on the other hand, will simply be, as always, a series of our label, our answer, which in this case is the left column, zero or one, left or stayed.
But if I run that, I've got my x and y split now. Now I'm going to split those x and y into our training and testing data. We're going to do our train test split and pass into it our x, our y, and what our train size should be, usually 20%.
And then we're going to add here, we're going to unpack the tuple that gives us back into x train, x test, y train, and y test. All right. A lot of our data is still numerical and at sort of different scales, right? You look at our average monthly hours here, 157, 159, but up to 200 plus, almost 300 in some cases.
So there's quite a lot of variation. And you look at that versus promotion last five years, it could be zero, one, or two, some very small number. We want to scale them all around the mean.
We'll use our standard scaler to do so. We'll say, give me a standard scaler. And then we'll say x train is actually the scaled version of x train.
And the same for x test, same for our test inputs. Okay. Now this line is almost exactly like it was last time, except instead of saying our model is a linear regression, it's a logistic regression.
And as our last pit before we get to, you know, evaluating it, let's train our model with model.fit this data. Here's the x train. Here's the y train.
Try to learn a pattern from that, please model. And I neglected to run this. You can see there's no check mark.
There we go. Run that, run that. All right.
Next we'll evaluate how our model did.