Learn how to scrape multiple pages by looping through URLs using Python and BeautifulSoup. Turn thousands of data points into structured datasets by extracting and converting key information.
Key Insights
- Loop through multiple pages (from 1 to 50) by dynamically generating URLs using Python's range function and f-strings to systematically scrape titles and prices.
- Extract book titles by finding all H3 tags on each page, then retrieving the "title" attribute from nested anchor tags, and append these titles to a list for structured data collection.
- Retrieve prices in numerical format by selecting paragraph elements with class "price_color," stripping currency symbols, and converting values into floats, resulting in a DataFrame containing 1,000 rows of book titles and prices.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
For our grand finale, let's reset our titles and prices and loop through all pages possible. So the way we're going to do that is if we look at this page here, we're actually at books.toscrape.com slash catalog slash page1.html. And if I go to next, well I'm at books.toscrape.com slash catalog slash page2.html. So what I want to do is loop through from 1 to 50, our pagination max that we found in the last step. And for each one, go make a beautiful soup, scrape that new page, and add the titles and prices to our new list.
Here, let's make that happen. For page num, adding those other programming language parentheses, not in Python please. For page number in the range 1 to pagination max, the thing that we got here in this step, 50 in our case.
And then we also need to add plus 1. And that's just because range is exclusive at the end. If we say 1 to 51, it'll be numbers 1 to 50. Okay, for every page number, what we want to do is make a URL.
I'm going to make an F string, meaning I'll be able to insert Python values into this string. I want to say it's, I'm going to copy and paste here. It's this, but this here is going to be some value.
That value is that page number right up here in the loop. First it'll be page 1, then page 2, page 3, all the way up to page 50. All right, so now that we've got that URL, we could say response equals requests.get that URL.
And our soup is beautiful soup from the response.content and HTML parser. All right, we're off and running. This will make a request.
It'll make 50 separate requests. Now we just need to generalize what we did earlier. I'm going to say titles equals titles plus.
A bunch of different ways you could do a loop within a loop. You could do a list comprehension. That's what I'm going to do.
And you could do .extend instead of this plus. I'm going to do it this way. All right, titles equals titles plus title.find the A tag.
Oh, for, so that's going to be for the H3s, but I actually haven't queried the H3s yet. So this isn't going to work yet. I'm going to say H3s equals soup.find all H3s.
Now, title.find A for every H3 in H3s. I find the A tag within each H3, and then I get its title attribute. Yep.
So find the A within the H3 and give me its title for every H3 in the H3s list. All right. And keep appending that to the end of our titles list.
Great. Let's do the same thing for prices. Prices might be a little tougher if we want to get it into numerical format, which we do.
I want price. I'm going to figure out what I'm going to do with it first, but first I'm going to say I want to loop through the prices. All right.
I, again, have not actually queried the prices. We'll say those prices are in P tags, if I remember. Let's double check.
We don't have to leave it up to memory, which is good. I'm going to say price elements. I'd actually say price tags, but then it sounds like a different meaning.
Price elements equals soup.find all P tags where the class is price underscore color. All right. That's our price elements.
So something for element in price elements. So for every element in price elements, now I want that elements text. It's get text.
Element dot get text. Oh, but there's going to be more because now I've got that, but I want to strip out the pound symbol. And I'm going to, again, need to go back and copy and paste that because I don't have that handy.
Okay. Strip that out, but then convert it to a float. So to read this in a more English way, append to prices for every element in price elements.
Get that elements text, strip its pound symbol, and pass it through the float function to make it a number. That looks pretty good. When this loop is done, we should have an updated prices and titles.
Let's run this. And it's taking a while because it's making 50 different requests to scrape. Let's see it in our data frame.
Let's put in a data frame and see if we can do it. Books is now a new data frame where the titles or title would actually be a better, more data-y name for this. Title is our titles.
And price is our prices. Let's see it. All right.
And down here, our last five at the end. And we got 1,000 of them. 1,000 rows times two columns.
Title and price for all 1,000 items on the entire webpage. Woof!