Exploring Web Scraping with BeautifulSoup

Learn to parse HTML content using Python's BeautifulSoup library. Gain practical skills to extract specific data from webpages programmatically.

Key Insights

Use Python's BeautifulSoup library to parse HTML content returned by request.get(), enabling extraction and querying of webpage elements such as headings (H3 tags).
After parsing HTML with BeautifulSoup, leverage built-in methods like findall() and get_text() to retrieve specific page content like act and scene names from Shakespearean texts.
Accessing webpage elements programmatically reduces manual effort and increases accuracy compared to manually copying and pasting content, especially when gathering numerous data points.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Make sure to run this code block that reassigns URL, otherwise your URL will still be our API URL from up here. Let's run that URL, and let's hit up that URL. We'll use the same basic code.

We'll run request.get to that URL, and that will give us back a response. And let's make sure that the response status code is equal to 200. If it's not, if we get back a response of like, yep, that's the page you wanted, here it is, everything went great, then we want to print something out, an error, to let people know that happened.

Request for data failed. But if I run this, we should get no printing. Great, because the status code was 200.

Alright, so we don't get back JSON. Last time we said, hey, give me response.json. And what that did is it converted from JSON to a regular string. Instead, if we print out response.content, we'll see it's an extremely long string that's all the HTML of the page.

Very, very long one-line string. You can see here there's some good old, I have sworn for three years term to live with me, some good Shakespearean language here. You know, still in contemplative and living art.

Data Science Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Yeah, that sounds like some Shakespeare stuff. Alright, so that's response.content. It's not JSON. It's not data we can convert to a dictionary.

Instead, it's something we're going to need to work with as HTML. And to do that, we're going to use our library, BeautifulSoup. And to BeautifulSoup, we pass our response.content, that thing we just printed out, all the HTML.

And we're going to say, parse that as HTML. Give me an HTML parser for that content. And that's what we actually get is a parser, something that we can then look at and say, okay, I want this bit.

I want that bit. I want all of those bits. Give me all the bits.

So what this gives us back is a parsable, a queryable object we can use to filter down to the data we actually want. So this is typically called soup, just in recognition that BeautifulSoup is the name of the library. Okay, now that we've got that, let's see if we can get, let me just run that, make sure we haven't made a mistake.

No errors, great. Let's find all H3s. We can say soup.findall. And it has this method built in because BeautifulSoup was created just for this purpose, to be able to parse a bunch of HTML.

Find all H3s. And if you're familiar with JavaScript's way of accessing the DOM, it's very, very similar. All right, so we say find all H3s.

Now, if I save that somewhere, maybe I save it as H3s. Let's print out H3s. Let's see if that's what we think it is.

All right, it is a list, a Python list, but it's not actually the text. It looks like the text. It's the printed out version of it.

You know, BeautifulSoup is doing some work to give us something that prints out nicely as text, but it's not actually text and you can tell because there's no quote marks. These are not strings. If I print out H3s at zero, I get back that, but what about the type of that? It's a BS4 element tag.

BS4 of BeautifulSoup version 4 is an element tag. It's not what we think it is. It's got all kinds of methods and abilities on it.

So one of those is get text. Let's say I've got this H3 zero, the very first one on this list. I can call it get text method, and that'll be the last thing, so it should output that.

It's the words act one. Now, it's a string. Now, it's an actual text.

So these aren't just a bunch of strings, these H3s. They're objects. They're dictionaries with methods on them like get text.

If we want to get the text from all of them, make a new list that is just the texts. That's a list comprehension if you know your Python. We could say H3 texts equals.

Now, let's give it a better name than that. Let's give it these are scene names, or if we want to be really explicit, act and scene names. It's a new list where we call H3.get text for every H3 in the H3s list.

If we take a look at act and scene names now, yep, there it is. It's just now the actual strings for all the act and the scene names as a list. Now, the type of this thing is string.

All right, so that was our first data scraping. Congratulations. Let's take a look at one more example.

We could find a what is act one, scene one, line 19, 1119. Here's how we're going to do that. We're going to say, oh, that's the news.

Let's take a look at this here. Okay. Let me inspect again.

1119, that war against your own affections. And we can see this. We're grabbing this text, that war against your own affections.

But if we're accessing data programmatically, then we don't care that we can necessarily copy and paste this if we wanted. If we wanted to get all of the act and scene names, we could spend a bunch of time copying and pasting them all. It would take a long time to find them all.

We might even miss some. That's the great thing about doing this programmatically, about doing something like, hey, get me all the act and scene names. I've got some code that can do that.

Even when we're doing something that's a little more trivial here, a little less real world. Getting the act and scene names, sure. But getting, you know, that war against your own affections, I could just copy and paste this.

But we're trying out some new concepts, and this concept is, what if we want just one thing? The way we're going to do that is we're going to find the A tag with that name. If you want something more specific like that, we can absolutely do it. And in the next video, we'll show you how.

Key Insights

Colin Jaffe

How to Learn Data Science