Master precise HTML parsing techniques by learning to extract attribute values and select elements nested within others using Python's Beautiful Soup library. Enhance your web scraping skills by effectively navigating HTML structures and handling nested elements with clarity.
Key Insights
- Utilize Beautiful Soup's
.findall()
method to locate specific HTML elements within other elements, such as extracting alla
tags contained insideblockquote
elements. - Access attribute values directly from HTML elements, demonstrated by retrieving the 'name' attribute from targeted
a
tags, enabling precise selection of HTML attributes. - Efficiently manage nested list structures returned from parsing HTML by flattening them using Python methods such as
.extend()
or list concatenation, facilitating simpler data manipulation.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's show you a couple more complex queries you may need. So one is getting attribute elements, attribute values from HTML elements, and the other is elements that are inside other elements. And we'll demonstrate this both by looking at these a tags with their name like this.
1.1.1, 1.1.2. What if we want that 1.1.1, the value not name equals and not the text from it, but literally the value here 1.1.1 or the value 1.1.2 or really all of them is what we want. If we want that we're going to need to access that. We also have found when we search for all a tags, we also got some, let's see if I can find them.
They're in here somewhere. They're not there. Here's the a tag here for the Shakespeare homepage and this one for Love's Labor Lost.
So these are the links up here. And so what we want ideally is to not have those. We want to be more specific.
In this case, what we want is the we want the values for the attribute name for every a tag that's in a block quote. Not the one, these ones up here. They don't even have name values.
We'll get an error if we try to access them. Instead, we'll say I want only the ones that are in block quotes. So here's how we do that.
First to get all the block quotes so we could say block quotes maybe equals soup.findall block quote. Okay. Now that we've got that we want to find all a tags that are inside those block quotes.
We could say a tags maybe sure equals instead of soup.findall we're going to start with the block quotes. Every single element that soup gives you back has its own query that is I have a method, bqs have a method for finding all things inside them. And in fact these a tags that are in the block quotes, if we need to find something inside these a tags, we can also do a tags.findall. So this will find all a tags that are within one of these block quotes except it won't quite.
Because findall is an attribute on one block quote. Not all of them. All of them is a list.
And a list doesn't have a .findall on it. But every element in the list, every block quote does. So this won't quite work.
Instead we need to loop through the bqs. So I'm actually going to simplify this. I'm going to make a names.
Pretty sure there's a way to do this with some overly complex list comprehensions. I think at that point it's like yeah, let's do a loop. We're going to loop through for every block quote in block quotes.
And for every one we'll say okay, let's get the name, let's get all the, now that we're in one block quote, it has a .findall on it. So let's say here a tags equals block quote singular findall. And you see I'm getting autocomplete here.
I was not getting it before because lists don't have .findall on them. findall a tags. Now that I've got that, I could do another loop, but I think at this point a list comprehension will work just fine.
So I'm going to do something, I'm going to say names.append, that's not quite right, but it's close enough, a new list comprehension. That's going to be for every tag in a tags. And for every single one I want to do something with that tag.
What I want is to get its name attribute, and it works just like accessing any key value pair as if tag is a dictionary, which it actually is. Anytime I'm doing a findall, I'm getting back a tag or just a find1, a .find, I'm going to get back something that has its own find attributes on it, its own methods and properties, gettext. Instead of gettext though, I want the name value.
And again, that's this right here. I don't want let fame that all hunt after in their lives. I want the name.
I want the value for the name attribute. And I would do that just as if it were any other dictionary. This isn't going to quite work, but we're getting pretty close.
Let's look at names. Hmm, we got some kind of error here. Oh, yeah, I said this won't quite work, but I neglected to comment it back out.
Let's try that again. All right, so we've almost got it. We're pretty close here.
This is an understanding of Python here. We have a list within our big list. Here is a list of all the lines that go with speech one, and then another list of all the lines that go with speech two, another list of all the lines that go with speech three, and so on.
And actually what we want is to what's called flatten out the list. Get rid of the lists within a list. Maybe we could do that instead of when we're saying put a new list in that names list.
Instead we could say I want to extend that list with the elements in this list. So it's going to concatenate those lists together. All right, if I run that now, now we have much more of a flat list with all the numbers in it.
We also could have done that. I went too high up. We also could have done that with, I believe this will work.
Names equals names plus, and we'll concatenate the list that way. Let's try that. Yep, same result.
If that's clearer to you then dot extend. Either one will work though. Yep, that worked.
Either way. Okay, so we're again showing you both things. Here's how we get an attribute, and here's how we find all things that are in another thing.
Find all a tags that are in this particular block quote. Keep in mind as always that these find all's and finds that are on it on one of these elements are never on the list itself. Bq's is the list we get back here.
It doesn't have a dot find all on it, but every element in it does. That's a bit of a gotcha thinking about a list versus what's on each element of the list. But we're going to use all these elements in our next big, big project for you folks.