The Ultimate Guide To Web Scraping - Node.js & Python
Aug 13, 2024In this article, we will cover web scraping for beginners. Web scraping is a technique used to extract data from websites. It is a valuable tool for data scientists, researchers, and developers who need to collect data from the web for analysis or other purposes. In this guide, we will cover the basics of web scraping, including how it works, the tools and libraries you can use, and some best practices to keep in mind.
We will also demonstrate how to scrape a website using both Node.js and Python, which have some really useful libraries for web scraping. For Node.js, we'll use a library called Puppeteer, a powerful tool that allows you to automate tasks in a headless browser, such as clicking buttons, filling out forms, and extracting data from web pages. For Python, we'll use Beautiful Soup, a popular library for extracting data from HTML and XML files. Beautiful Soup provides simple methods for navigating, searching, and modifying a parse tree, making it incredibly easy to scrape information from web pages. You can find the code for both scrapers at GitHub.
This guide is sponsored by Bright Data. They offer all kinds of tools for scraping, including powerful APIs, proxies, a scraping browser, and more.
What is Web Scraping?
Web scraping is the process of extracting data from websites. While this can be done manually, it is often more efficient to automate the process with a program. Web scraping is commonly used to collect data for research, analysis, or other purposes. For example, you might want to collect data on job listings from a job board or gather product information from an e-commerce website. Web scraping allows you to automate this data collection process, saving time and effort. While APIs are the best way to get data from websites, not all websites offer APIs, so web scraping can be a good alternative. However, this brings us to our next point.
Legal & Ethical Considerations
Web scraping can be a controversial topic since some websites do not allow scraping of their content. Some allow it with certain stipulations, while others do not allow it at all. It's important to be aware of the legal and ethical considerations when scraping websites and to make sure you have permission to scrape the website you are targeting. In this guide, we will be scraping a website called books.toscrape.com for educational purposes. This website is specifically designed for scraping and is open to the public.
History of Web Scraping
Web scraping has been around for a long time. The very first tool used for web scraping, aside from the browser itself, was created in 1993 and was called the World Wide Web Wanderer. It was a bot that would navigate from link to link and index the content of the pages it visited. Although it wasn't a scraper in the traditional sense, it laid the groundwork for automated web data collection and search engine crawling.
In 2004, the first traditional web scraper called Beautiful Soup was created. It's a Python library that is still used to this day and allows developers to extract data from HTML and XML files. Since then, web scraping has become a common practice for collecting data from the web.
Web scrapers have gained more popularity recently due to the increasing amount of data available on the web.
Why Use Web Scraping?
Here are some common use cases for web scraping:
- Data Collection: Collecting data from websites for research or analysis. This could include data on products, jobs, real estate properties, news articles, or other information.
- Price Monitoring: Monitoring prices on e-commerce websites to track changes over time. This can be used to find the best deals for yourself or to track competitors' prices for a business.
- Content Aggregation: Aggregating content from multiple websites to create a new website or service. For example, building a news aggregator that collects news articles from multiple sources and displays them on a single website.
- Lead Generation: Collecting contact information from websites for sales or marketing purposes. For example, scraping contact information from a directory website to generate leads for a business.
- Market Research: Collecting data on competitors or industry trends from websites.
- Social Media Monitoring: Collecting data from social media websites to track trends or sentiment.
How Does Web Scraping Work?
I'll start with a general summary of how web scraping works, and then we'll delve deeper into the steps. Ultimately, I'll show you how to do it with code.
Web scraping works by sending a request to a website, downloading the HTML content of the page, and then extracting the data you are interested in. There are a few different ways to extract data from a web page, depending on the structure of the page and the type of data you are looking for. Some common methods include using regular expressions, parsing the HTML content with a library like Beautiful Soup, or using a tool like Puppeteer to interact with the page in what we call a headless browser. These tools are powerful because they allow you to interact with the page as if you were a user—clicking buttons, filling out forms, and extracting data from the page. Many tools used for web scraping are also used for testing web applications, as they allow you to automate tasks in a headless browser. You can then do what you want with that data, such as saving it to a file or database.
Tools & Libraries for Web Scraping
There are many tools and libraries available for web scraping, depending on the language you are using and the complexity of the scraping task. Many tools are used for both testing your own web applications and scraping data from other websites.
Here are some popular tools and libraries for web scraping:
- Puppeteer: Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It is a powerful tool for automating tasks in a headless browser, such as clicking buttons, filling out forms, and extracting data from web pages.
- Cheerio: Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server. It is a popular tool for web scraping in Node.js.
- Playwright: Playwright is a Node.js library that provides a high-level API to automate tasks in a headless browser. It is similar to Puppeteer but supports multiple browsers, including Chrome, Firefox, and WebKit.
- BeautifulSoup: BeautifulSoup is a Python library for parsing HTML and XML documents. It provides a simple API for navigating and searching the HTML content of a web page, making it easy to extract data from web pages.
- Scrapy: Scrapy is a Python framework for web scraping. It provides a powerful API for extracting data from websites, handling pagination, and following links to scrape multiple pages.
- Selenium: Selenium is a popular tool for automating web browsers. It provides a WebDriver API that allows you to interact with a web page in a headless browser. Selenium is often used for testing web applications but can also be used for web scraping.
There are many other tools and libraries available for web scraping, depending on your language and requirements. Even HTTP libraries like Axios or Requests can be used for simple scraping tasks.
Basic Steps to Web Scraping
Here are the basic steps to web scraping:
- Identify the Website: The first step is to identify the website you want to scrape. Make sure you have permission to scrape the website, as scraping without permission can be illegal. As I mentioned, there are some legal and ethical considerations when it comes to web scraping, so be sure you are aware of the rules and regulations in your country.
- Inspect the Page: Once you have identified the website, you will need to inspect the page to determine the structure of the page and the data you want to extract. You can do this by right-clicking on the page and selecting "Inspect" in your browser or using a tool like Chrome DevTools. Examine the HTML structure of the page to identify the data you are interested in. This includes looking at the class names, IDs, and other attributes of the elements on the page.
- Write the Code: Once you have identified the data you want to extract, you can write the code to scrape the website. This will involve sending a request to the website, downloading the HTML content of the page, and extracting the data you are interested in. This usually works similarly to how you select elements in JavaScript or jQuery. Cheerio is actually a subset of jQuery, so it's very similar. You may have to handle pagination and other tasks that I'll cover later.
- Run the Code & Extract Data: Once you have written the code, you can run it to scrape the website and extract the data you are interested in. You can then save the data to a file (JSON, CSV, XML, etc.) or database for further analysis.
Additional Steps & Considerations
- Handle Pagination: If the data you want to scrape is spread across multiple pages, you will need to handle pagination to scrape all the data. You can do this by following the links to the next page and scraping each page in turn.
- Handle Dynamic Content: If the website uses JavaScript to load content dynamically, you may need to use a tool like Puppeteer to interact with the page in a headless browser.
- Handle Errors: Web scraping can be error-prone, so you will need to handle errors in your code to ensure that your scraper runs smoothly.
-
Respect Robots.txt: Many websites have a
robots.txt
fileIt seems that the response was cut off. Let me continue the markdown format for you: -
Respect Robots.txt: Many websites have a
robots.txt
file that specifies which pages can be scraped and which should not be. These are typically used to tell search engines whether to crawl the page or not. Make sure to respect therobots.txt
file of the website you are scraping. - Rate Limiting: To avoid overloading the website you are scraping, you should add rate limiting to your scraper to limit the number of requests you send to the website.
Scraping with Puppeteer
In this guide, we will be using Puppeteer to scrape a website. Puppeteer is a powerful tool that allows you to automate tasks in a headless browser and provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol.
We will be scraping a website that is actually designed for scraping: books.toscrape.com. It is set up like a real e-commerce website but is open to the public and designed for educational purposes. It uses pagination, which is something you will often encounter when scraping websites. So we will learn how to handle pagination in this guide.
Installing Puppeteer
To use Puppeteer, you need to have Node.js installed on your machine. If you don't, go to Node.js and download and install it.
Open up a terminal and run the following:
npm init -y
npm install puppeteer
This will create a package.json
file and install Puppeteer in your project.
Using ES Modules
I prefer to use the ES Modules syntax in my Node.js projects, which is the modern way to write JavaScript code. To use ES Modules in Node.js, you need to add "type": "module" to your package.json
file.
{
"type": "module"
//...
}
Writing Your First Scraper
Create a new file called scrape.js
and add the following code:
import puppeteer from 'puppeteer';
const run = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = 'https://books.toscrape.com';
await page.goto(url);
const title = await page.title();
console.log(`Page Title: ${title}`);
await browser.close();
};
run();
Puppeteer uses async/await syntax. In many examples, you will see an unnamed IIFE (Immediately Invoked Function Expression) that wraps the async function. I like to define the async function separately and then call it at the end of the file.
The run
function is an async function that launches a new browser instance, creates a new page, navigates to a website, gets the title of the page, and then closes the browser.
So you see what we did here? We launched a headless browser, navigated to a website, and extracted the title of the page. This is a simple example, but you can do pretty much anything that you can do in a browser with Puppeteer.
Scraping The Books On The First Page
Let's create a script that will get all of the books on the first page and format them into a JSON array with the title, price, stock, rating, and link.
Let's first get rid of the two lines that get and log the title of the page.
page.evaluate
Puppeteer provides a method called page.evaluate
that allows you to run JavaScript code in the context of the page. This is useful for extracting data from the page, as you can use JavaScript to select elements on the page and extract the data you are interested in. We simply use methods like querySelector
to select elements. So if you are familiar with JavaScript, you will feel right at home.
Let's add the following code to the run function:
const books = await page.evaluate(() => {});
This is where we will write the JavaScript code to extract the data from the page. We will use the document object to select elements on the page and extract the data we are interested in.
Examining The Page
Remember our second step in the basic steps to web scraping? We need to inspect the page to determine the structure of the page and the data we want to extract. Open the browser dev tools and inspect the page. You will see that each book is contained in an article element with a class of product_pod
.
So within the page.evaluate
method, we can use document.querySelectorAll
to select all of the book elements on the page.
const bookElements = document.querySelectorAll('.product_pod');
We can't console.log within the page.evaluate
method because this code is running in the browser context, not in the Node.js context. So if you try to log bookElements
, you will get an error. You won't see anything. But you can return the data and log it outside of the page.evaluate
method.
const books = await page.evaluate(() => {
const bookElements = document.querySelectorAll('.product_pod');
return bookElements;
});
console.log(books);
What this will give you is a NodeList
, which is an array-like structure, of all the books selected elements on the page. Right now, each item will be an empty object {}
because we have not selected any data yet to extract. We get 19 items because there are 20 books on the page and of course, the NodeList
is a 0-based index.
A NodeList
is not really useful to us, but we can convert it to an array using the Array.from
method. Then we can map over the array and extract the data we are interested in.
const books = await page.evaluate(() => {
const bookElements = document.querySelectorAll('.product_pod');
return Array.from(bookElements).map((book) => {
return book;
});
});
console.log(books);
This will give you an array of empty objects, but now we have access to each book element and we can extract the data we are interested in.
Extracting The Data
I want to get the title, price, stock, rating, and link for each book. Let's start with the title.
Title
If we examine the book element, we can see that the title is contained in an a
element within an h3
element. It does not have a class, however the heading has a title
attribute. We can use that.
const books = await page.evaluate(() => {
const bookElements = document.querySelectorAll('.product_pod');
return Array.from(bookElements).map((book) => {
const title = book.querySelector('h3 a').getAttribute('title');
return title;
});
});
console.log(books);
This will give you an array of book titles.
Price
Now let's extract the price. Again, examining the book element, we can see that the price is contained in a p
element with a class of price_color
. We can use that and the textContent
property to get the price.
const price = book.querySelector('.price_color').textContent;
Stock
Now let's extract the stock. The stock is contained in a p
element with a class of instock availability
. Let's use a ternary operator to check if the stock is in stock or out of stock.
const stock = book.querySelector('.instock.availability')
? 'In stock'
: 'Out of stock';
Rating
Now let's extract the rating. The rating is contained in a p
element with a class of star-rating
. It also has a class with the number of stars, so for instance, if it has 3 stars, it will have the classes star-rating Two
. We can use the following code to extract the rating.
const rating = book.querySelector('.star-rating').className.split(' ')[1];
We are using the split
method to split the class name into an array and then getting the second item in the array, which is the number of stars.
Link
Finally, let's extract the link. The link is contained in the a
element within the h3
element. We can use the href
attribute to get the link.
const link = book.querySelector('h3 a').getAttribute('href');
Putting It All Together
We just need to return an object with the title, price, stock, rating, and link for each book. Here is the full code:
import puppeteer from 'puppeteer';
const run = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const url = 'https://books.toscrape.com';
await page.goto(url);
// Extract book information
const books = await page.evaluate(() => {
const bookElements = document.querySelectorAll('.product_pod');
return Array.from(bookElements).map((book) => {
const title = book.querySelector('h3 a').getAttribute('title');
const price = book.querySelector('.price_color').textContent;
const stock = book.querySelector('.instock.availability')
? 'In stock'
: 'Out of stock';
const rating = book.querySelector('.star-rating').className.split(' ')[1];
const link = book.querySelector('h3 a').getAttribute('href');
return {
title,
price,
stock,
rating,
link,
};
});
});
// Log the collected data to the console
console.log(books);
await browser.close();
};
run();
Now if you run the script, you will get an array of objects with the title, price, stock, rating, and link for each book on the page.
Saving The Data
You probably want to save the data somewhere. You can save it to a JSON file, a CSV file, or a database. In this example, we will save the data to a JSON file.
Let's first import the fs
module to write the data to a file.
import puppeteer from 'puppeteer';
import fs from 'fs'; // Import the fs module
Now, just replace the console.log at the bottom with the following code:
// Save the collected data to a JSON file
fs.writeFileSync('books.json', JSON.stringify(books, null, 2));
// Log the collected data to the console
console.log('Data saved to books.json');
We are using the writeFileSync
method to write the data to a file called books.json
. We are using JSON.stringify
to convert the data to a JSON string and we are passing null
for the replacer, which is used to filter the properties of the object, which we are not using, and 2
for the space parameter, which is used to format the JSON string with 2 spaces for readability.
Now if you run the script, you will see the data saved to a file called books.json
.
Congrats! you have successfully scraped a website using Puppeteer.
Handling Pagination
Usually, when you are scraping a website, you will need to handle pagination to scrape all of the data. Otherwise, you just get the data from the first page.
Adding Variables
First, let's add some variables to keep track of the current page and the total number of pages.
Add the following variables right below the const page = await browser.newPage();
line:
const allBooks = [];
let currentPage = 1; // Start from page 1
const maxPages = 10; // Number of pages to scrape
This script is configured to scrape 10 pages, but you can adjust it to scrape more pages if needed. Be aware that scraping more pages will increase the time required to complete the task and consume more resources.
Looping Through The Pages
Now, let's add a loop to scrape all of the pages. We will use a while
loop to keep scraping pages until we reach the maximum number of pages.
We need to wrap it around everything like this:
while (currentPage <= maxPages) {
const url = 'https://books.toscrape.com';
await page.goto(url);
// Extract book information
const books = await page.evaluate(() => {
const bookElements = document.querySelectorAll('.product_pod');
return Array.from(bookElements).map((book) => {
const title = book.querySelector('h3 a').getAttribute('title');
const price = book.querySelector('.price_color').textContent;
const stock = book.querySelector('.instock.availability')
? 'In stock'
: 'Out of stock';
const rating = book.querySelector('.star-rating').className.split(' ')[1];
const link = book.querySelector('h3 a').getAttribute('href');
return {
title,
price,
stock,
rating,
link,
};
});
});
}
The URL
Since we are dealing with pagination, we need to update the URL to include the page number. Click on the next page and you will see that the URL changes to https://books.toscrape.com/catalogue/page-2.html
. So let's update the URL to include the current page number.
const url = `https://books.toscrape.com/catalogue/page-${currentPage}.html`;
Add The Books To The Array
Now we need to add the books on to the allBooks
array. We can use the push
method to add the books to the array. Go to where the page.evaluate
method ends and add the following code:
allBooks.push(...books);
console.log(`Books on page ${currentPage}:`, books);
This will add the books from the current page to the allBooks
array and log the books to the console. Logging the books to the console is optional, but it can be useful for debugging.
Increment The Page
Finally, we need to increment the currentPage
variable at the end of the loop. Add the following code right before the closing brace of the while
loop:
currentPage++;
When we write to the file, we will write the allBooks
array instead of the books
array.
fs.writeFileSync('books.json', JSON.stringify(allBooks, null, 2));
Putting It All Together
This is the final script:
import puppeteer from 'puppeteer';
import fs from 'fs';
const run = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const allBooks = [];
let currentPage = 1; // Start from page 1
const maxPages = 10; // Number of pages to scrape
while (currentPage <= maxPages) {
const url = `https://books.toscrape.com/catalogue/page-${currentPage}.html`;
await page.goto(url);
// Extract book information
const books = await page.evaluate(() => {
const bookElements = document.querySelectorAll('.product_pod');
return Array.from(bookElements).map((book) => {
const title = book.querySelector('h3 a').getAttribute('title');
const price = book.querySelector('.price_color').textContent;
const stock = book.querySelector('.instock.availability')
? 'In stock'
: 'Out of stock';
const rating = book
.querySelector('.star-rating')
.className.split(' ')[1];
const link = book.querySelector('h3 a').getAttribute('href');
return {
title,
price,
stock,
rating,
link,
};
});
});
allBooks.push(...books);
console.log(`Books on page ${currentPage}:`, books);
currentPage++; // Move to the next page
}
// Save the collected data to a JSON file
fs.writeFileSync('books.json', JSON.stringify(allBooks, null, 2));
// Log the collected data to the console
console.log('Data saved to books.json');
await browser.close();
};
run();
Now if you run the script, you will get an array of objects with the title, price, stock, rating, and link for each book on 10 pages.
Scraping In Python
Now we are going to do the same thing using Python and a library called Beautiful Soup. I assume that you have Python 3 installed.
Create A Virtual Environment
Let's start by creating a virtual environment:
python3 -m venv env
Now activate the environment:
source env/bin/activate
Now let's install Beautiful Soup and the Requests package for making HTTP requests:
pip install requests beautifulsoup4
Now let's create a new file called scraper.py
and open it in the text editor or IDE.
Make The Request
First, we are going to import the libraries and make the initial request:
import requests
from bs4 import BeautifulSoup
import json
def fetch_books(page_number):
url = f"https://books.toscrape.com/catalogue/page-{page_number}.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
def main():
fetch_books(1)
if __name__ == "__main__":
main()
We first import both the requests
and beautifulsoup
libraries as well as json
, so we can parse the JSON content. We have a main
function, which is the entry point that right now is just calling the fetch_books
function, passing in a 1 for the first page. we set an f string and add that page number to the url. Make a get request and pass the response text into the BeautifulSoup
constructor. We are using the html.parser
parser. Then we just print the soup object, which is the parsed HTML content of the page.
Extracting The Data
Let's put the result into a variable and loop over it and extract the data we are interested in:
import requests
from bs4 import BeautifulSoup
def fetch_books(page_number):
url = f"https://books.toscrape.com/catalogue/page-{page_number}.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = []
book_elements = soup.find_all('article', class_='product_pod')
for book in book_elements:
title = book.find('h3').find('a')['title']
price = book.find('p', class_='price_color').text
stock = 'In stock' if 'In stock' in book.find('p', class_='instock availability').text else 'Out of stock'
rating = book.find('p', class_='star-rating')['class'][1]
link = book.find('h3').find('a')['href']
books.append({
'title': title,
'price': price,
'stock': stock,
'rating': rating,
'link': f"https://books.toscrape.com/catalogue/{link}"
})
return books
def main():
fetch_books(1)
if __name__ == "__main__":
main()
We use the find_all
method to find all of the book elements on the page. Then we loop over each book element and extract the title, price, stock, rating, and link. We use the find
method to find the elements we are interested in and then extract the data using the text
and ['attribute']
syntax.
Saving The Data
Now let's loop over the pages and save the data to a JSON file:
import requests
from bs4 import BeautifulSoup
import json
def fetch_books(page_number):
url = f"https://books.toscrape.com/catalogue/page-{page_number}.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = []
book_elements = soup.find_all('article', class_='product_pod')
for book in book_elements:
title = book.find('h3').find('a')['title']
price = book.find('p', class_='price_color').text
stock = 'In stock' if 'In stock' in book.find('p', class_='instock availability').text else 'Out of stock'
rating = book.find('p', class_='star-rating')['class'][1]
link = book.find('h3').find('a')['href']
books.append({
'title': title,
'price': price,
'stock': stock,
'rating': rating,
'link': f"https://books.toscrape.com/catalogue/{link}"
})
return books
def main():
all_books = []
max_pages = 10 # Number of pages to scrape
for current_page in range(1, max_pages + 1):
books_on_page = fetch_books(current_page)
all_books.extend(books_on_page)
print(f"Books on page {current_page}: {books_on_page}")
# Save the collected data to a JSON file
with open('books.json', 'w') as f:
json.dump(all_books, f, indent=2)
# Log the collected data to the console
print('Data saved to books.json')
if __name__ == '__main__':
main()
We set the max_pages
variable to 10, which is the number of pages we want to scrape. We then loop over the pages and call the fetch_books
function for each page. We extend the all_books
list with the books on the current page. We then save the data to a JSON file using the json.dump
method.
Run the file and you should see the data saved to a file called books.json
.
Conclusion
We covered the basics of web scraping and you learned how to scrape a website that uses Pagination in both Node.js and Python and some really useful libraries. I hope you enjoyed this tutorial and you can start your scraping journey.
Stay connected with news and updates!
Join our mailing list to receive the latest news and updates from our team.
Don't worry, your information will not be shared.
We hate SPAM. We will never sell your information, for any reason.