Advanced DataFrame Indexing and String Manipulation

Pandas offers versatile alternatives to traditional indexing methods with at and iat, simplifying data editing tasks. Learn how to effectively filter and manipulate strings within datasets using partial matching and replacements.

Key Insights

Explore at and iat as effective alternatives to the traditional loc and iloc indexing methods in Pandas, allowing users to modify DataFrame values accurately and efficiently based on row and column locations.
Use the Pandas .str.contains() method for partial string matching, improving search results by capturing variations like "seafood salad" or "shrimp salad," and utilize the parameter case=False to ignore case sensitivity and broaden the search scope.
Leverage the .str.replace() method to perform efficient string replacements within DataFrame columns, correcting errors or changing terminology easily while ensuring best practices by using direct selection with .loc to avoid common Pandas warnings.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

This is a lesson preview only. For the full lesson, purchase the course here.

Now there's an alternative to lock and ilock called at and iat. So one more time we're going to change the name of the salad, but we're going to use iat. Fooddf.at, which is analogous to lock.

We're going to go into row eight—excuse me, at is analogous to, yeah, at is analogous to lock. We're going to come into row eight, column 'item', and set it equal to seafood salad. There you go.

Seafood salad. Now iat, the 'i' being integer, is analogous to ilock. So let's go back to shrimp salad, and we'll say fooddf.iat, which works just like ilock, 8,0—right, the integer location of that first item column—setting it equal to shrimp salad, and it's back to shrimp salad.

So that is an alternative to lock and ilock, at and iat. Just due to personal preference and long habit, I prefer lock and ilock, and I think you see lock and ilock more often in documentation tutorials when you're googling stuff.

Now what if you're looking for a string match? I want every single result. I'm not using greater than or equal to here. I'm not using greater than or less than, or greater than or equal to or less than or equal to like we did for the numeric filters when we were getting the price greater than 15 and so on, right? Let's go look at that example.

Python for Data Science Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes, 1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

When you're doing a filter, that Boolean comparison is the current price. It's going to iterate through and it's going to take every price and see if it's greater than or equal to 15, and if it is—if that return is true—it's going to accumulate it in the result. But we're not using double equal, right? We could, looking for something that's exactly 500 calories or exactly $20 or something.

Obviously, you get fewer results when you do an exact match. We're going to be doing it. We're going to do a double-equal comparison, but we're going to look for salads.

Now, if you look in our output, we have one, two, three, four, five salads. Granted, there are different kinds of salads, but there's still five items with the word salad in it. So let's get all the salad rows via an exact match.

We're going to make a new DF. Let's call it salad DF. That's going to be your food DF, and our Boolean comparison is going to be the food DF item equal to salad.

And let's run and see how many salads we get. And we get zero salads. And the reason we got zero salads is because even though you have five salads, none of the salads are exactly called salad.

They're all something salad. So plain old salad doesn't catch any of them because you're going for an exact match. So what we really need to do, and what you would typically do when you're looking for a string match, is a partial match with string contains.

So you don't just double equal. You use .str. There's a string property of the column, and you go into the string and then look for contains—call the contains method and feed in the word that you're looking for.

It's kind of like an SQL query with LIKE where, you know, when you search Google for something like bell curve, you're not saying, “Show me the web pages that only have the word bell curve exactly.” No, it contains bell curve. Otherwise search would be useless.

Filter would be useless, ineffective. We just want the items that contain the word salad. We're going to say to get all the salads, use str.contains('salad'), and to make it not case sensitive—because programming is case sensitive—pass the optional argument case=False.

We're going to say salad DF again equals food DF. We're going to do a little Boolean comparison.

Food df item .str.contains('salad', case=False)

And now that we're saying we don't need an exact match, how many salads are we going to get? Unfortunately, we still get no salads. The reason for that is because in the dataset, in the data frame, salad is capitalized in all cases, whereas we're searching for lowercase salad.

So you don't have to care about the case sensitivity. We pass in that extra parameter case=False. Run it again and now you are good with your five salads.

And we've already got our big kahuna burger. Let's take that out. So how many burgers do we have? Let's go back to the original.

You've got one burger, two burgers, three: hamburger, bison burger, and big kahuna burger. So let's get all the burgers.

Challenge. Get all the burgers into a new data frame called burgers DF. Pause the recording and come back when you're ready for me to show you the solution.

Hopefully you'll get it on your own. It's pretty much like salads except you're looking for burger. All right.

Here's the solution. We're going to say burgers DF = food DF, and we're going to pass in our little Boolean comparison, food DF on the item column.

We're going to say string contains burger, and we're careful to say case=False. And there you go.

There's your three burgers. Now what about string replace? This is another method you can call on strings. We have string contains.

So let's do just a little thing. Let's deliberately mess up the spelling of something. We're going to set a column equal to itself but with the string replace method called on it.

You remember string replace: find cat, replace with dog; find O-R, O-U-R for British spelling, replace with O-R. Let's do a little review of the normal string replace.

Review of regular Python string.replace. And, this being the data frame—the pandas version—a little bit different but kind of similar, right? Just replace. We'll say statement = "Cats are better than dogs." No, we don't want to say that.

"Cats are the best. Cats rule." We're going to change that to dogs.

We'll say statement = statement.replace('cat', 'dog'). Even if it doesn't have an S, it will still change cat to dog.

Then we'll print the statement. There you go. So that's a straight-up simple Python regular string method called replace, which is built in here to pandas.

So what we want to do is find salad, mess up the spelling, and then see if we can fix it again. We call that on the column. We'll say fooddf['item'], set the item column equal to itself via item.str.replace.

So you're in the item on every single row, and you're going to look for salad and replace it with ensalada. How about case=False? What doesn't it like? Not found.

Okay. Food df. Oh, let's see.

Food df. We're going to the food df.

We're going to the item, we're going to the string, and we're replacing the word salad with something else. We're saying case=False.

Oh, you don't wrap it in here because it's just one column to one column. Whoa. Set a copy on a slice.

Okay. So here's what it wants. It doesn't like that.

It works. Ensalada. Okay.

All the salads are ensaladas now, but it doesn't really like it. Now we're going to fix it back how it was. We're going to go ensalada back to salad.

We'll go capital S, and we don't want this error message. We're going to say fooddf.loc[:, 'item']. I think that would satisfy it.

All rows, that column. There it is. That worked.

Back to salad. We did a direct selection without ilock—just targeted the column—and it didn't really like that.

It warned and said to use lock. So fine. Lock being we want:

Colon, all rows, comma item column. Set equal to itself with string replace, ensalada changing to the word salad.

So that should probably be lock colon—boom. Lock colon—boom. There we go.

And we'll replace it with ensalada.

Key Insights

Brian McClain

How to Learn Python