Logistic Regression with Data Scaling and Preparation

Enhance your predictive analysis skills by mastering logistic regression techniques using key employee metrics. Learn how to effectively clean, scale, and split your data to build robust logistic regression models.

Key Insights

The article outlines how to prepare data for logistic regression by selecting relevant columns such as "low," "medium," "high," "satisfaction level," "average monthly hours," and "number of promotions in the last five years."
It emphasizes the importance of scaling numerical data to address variations in scale, demonstrating this by applying the Standard Scaler to training and test datasets.
The article discusses a structured approach to splitting the dataset into training and testing sets, typically allocating 20% of the data for testing to effectively evaluate the logistic regression model's performance.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Now that we have our data in a pretty good format, we did some data analysis, and now we can think about our domain knowledge. And we're going to try—and again, this is the kind of thing that you should work on, that you're welcome to keep working on, and that we encourage you, in fact, to continue thinking through—which of these columns will help, trying out different amounts, massaging the data in any way you want, looking at outliers, and any of the other tools that we'll look at. But we want to show you, we want to talk to instead, we want to speak to the amount, we want to speak to what's different here with a logistic regression instead of a linear regression.

So let's make that happen. We're going to use for our X the following columns. We're going to use low, medium, and high.

Then we're also going to use satisfaction level, average monthly hours, and that is the way the original column is spelled. And how many promotions, oops, it's got to be a string, how many promotions they've had, pardon me, in the last five years. Y, on the other hand, will simply be, as always, a series of our label, our answer, which in this case is the left column, zero or one, left or stayed.

But if I run that, I've got my X and y split now. Now I'm going to split those X and y into our training and testing data. We're going to do our train test split and pass into it our X, our y, and what our train size should be, usually 20%.

And then we're going to unpack the tuple it gives us back into X_train, X_test, y_train, and y_test. All right. A lot of our data is still numerical and at sort of different scales, right? You look at our average monthly hours here, 157,159, but up to 200-plus, almost 300 in some cases.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

So there's quite a lot of variation. And you look at that versus promotions in the last five years, it could be zero, one, or two, some very small number. We want to scale them all around the mean.

We'll use our standard scaler to do so. We'll say, give me a standard scaler. And then we'll say X_train is actually the scaled version of X_train.

And the same for X_test, same for our test inputs, okay? Now this line is almost exactly like it was last time, except instead of saying our model is a linear regression, it's a logistic regression.

And as our last bit before we get to, you know, evaluating it, let's train our model with model.fit on this data. Here's X_train. Here's y_train.

Try to learn a pattern from that, please, model. And I neglected to run this. You can see there's no check mark.

There we go. Run that, run that. All right.

Next we'll evaluate how our model did.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Machine Learning