Understanding HTML Structure for Effective Web Scraping

Gain practical insight into web scraping using HTML structure through analyzing Shakespeare's complete works online. Learn essential HTML basics to efficiently extract structured data from web content.

Key Insights

The article introduces basic HTML knowledge, emphasizing the importance of understanding HTML structure (like headings H3 and tags) for effective web scraping.
It uses the MIT-hosted Complete Works of William Shakespeare webpage, a public domain resource established in 1993, as a practical example for analyzing content and HTML structure.
Inspecting HTML elements (such as heading tags and speech tags) through browser tools allows users to strategically identify and extract specific data like act and scene titles from structured web pages.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

In order to understand the shape of a page, in order to understand how do these pieces fit together, which piece do I need and how do I access it? In order to do that, we need to understand HTML. I'm not going to spend a whole course on HTML. There are whole courses on HTML.

Instead, I'm gonna give you the briefest introduction enough to understand HTML, enough to be dangerous, enough knowledge to get going with this. So here we have a URL that we'll be using to scrape some data from. And if you click on it, it is the full script to Love's Labor Lost by Shakespeare, which is very easy to scrape.

That is public domain, a public domain work. Shakespeare has been dead for several hundred years. Now, I think that link actually doesn't work.

This is a very old page. I did want to show you what page this is. It's the Complete Works of William Shakespeare.

The web's first edition of the Complete Works of William Shakespeare. This site has offered Shakespeare's play and poetry to the internet community since 1993. Wow.

Python for Data Science Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes, 1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

And if you look at, you know, it's full news items, goes back all the way to December 93. He was trying to avoid finishing his paper on Othello and created this server instead. So, you know, very typical like MIT students' way of putting off doing English work is to instead create a server to put up the full works of that author.

Very MIT. So this page has been around a long time. They won't mind if we scrape this public domain data from it.

And that's why we've chosen it. When we look at this data, we've got a sort of visible structure to it. We have this Act One, the Scene One, and this text is like the same size you could see as each other.

These act and scene labels. And then we have some text for that scene. And then we have a person's name, a little bit smaller than the other text, but maybe a bit bigger than the main text.

And then our full speech by Ferdinand and then another name, slightly bigger. And then speech by Longaville, another name, et cetera. And this structure continues all the way through to the next scene, which is somewhere down here.

But Shakespeare plays are not short. Okay, here we go. Scene Two, the same, right? Again, slightly bigger text showing that we're in a new section, a new scene.

Okay, so there's a structure here and that structure is mirrored. That structure is built actually based on HTML. So in your browser, in any browser, you can right click and by right click, I mean, usually click on the lower right of your trackpad.

If you have a mouse, it's actually the right mouse button. Control click might do it for you on some computers or ALT click on others. But either way, you could pull up a contextual menu and clicking on any of this text on our page will give you an option for inspect.

When you inspect, you should get a bunch of Cody looking stuff here that shows you the HTML. And in fact, when you put your cursor over the stuff over here to the right, it highlights what that goes with on the actual page, right? So here is a, what's called an H3. You can see this text H3, act one, H3.

That's this text right here. And again, if I put my cursor over this act one here, it highlights it over there on the page. And the same for the next, another H3.

We said this was the same size. H3 is a size of text essentially. It's a heading level three for scene one.

And then we go down and we have this speech tag and in there a block quote tag and all of these. So this is splitting off our page into areas that have structure to them. That's what HTML is.

It's a markup language. It's for marking up your text and saying, here's this part, it looks like this. Here's this part of the text is gonna look like this.

When we put this text in the H3s here, we're saying act one should be a bigger heading. When we put scene one, the King of Navarre's Park here in H3s, we're saying, hey, that's also a big heading. When we put this a name equals speech here, we're saying a slightly bigger heading, but not as big as the other ones, right? We have these very specific bits of language around the actual text.

It's not text. The word H3 doesn't show up there, right? It's text about the text. It's information about the text.

We have that and within that we have actual text like scene one, the King of Navarre's Park. So that's what HTML is. And when you create a page, an HTML page, you use this HTML code like H3 to set aside and say, mark off this area, it is of this type.

We can utilize that when scraping it to say, hey, if I want all the act ones, all the act names, act and scene names, well, those all are in H3s. That's not a rule of the internet. That's just a rule of the person who made this page.

We can look at it, we can examine the structure here and say, okay, if we want the list of acts and scenes, we want that data, well, we need to look to H3s for this one, for this kind of page. And that's the process we're going to overall be looking at as we work through these data scraping problems. Let's get started.

Key Insights

Colin Jaffe

How to Learn Python