w3resource

Python Project - Basic Web Scraper Solutions and Explanations


Basic Web Scraper:

Learn web scraping by extracting data from a website using libraries like BeautifulSoup.

Input values:
None (Automated process to extract data from a specified website).

Output value:
Data extracted from the website using libraries like BeautifulSoup

Example:

Input values:
None
Output value:
List all the h1 tags from https://en.wikipedia.org/wiki/Main_Page: <h1 class="firstHeading mw-first-heading" id="firstHeading" style="display: none"><span class="mw-page-title-main">Main Page</span></h1> <h1><span class="mw-headline" id="Welcome_to_Wikipedia">Welcome to <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a></span></h1>

Here are two different solutions for a basic web scraper using Python. The goal of the scraper is to extract data (like all h1 tags) from a website using libraries such as 'BeautifulSoup' and requests.

Prerequisites:

To run these scripts, you'll need to have the following libraries installed:

  • requests: To send HTTP requests to the target website.
  • BeautifulSoup from bs4: To parse the HTML and extract data.

You can install these libraries using:

pip install requests beautifulsoup4

Solution 1: Basic Web Scraper using 'requests' and 'BeautifulSoup'

Code:

# Solution 1: Basic Web Scraper Using `requests` and `BeautifulSoup`
# Import necessary libraries
import requests  # Used to send HTTP requests
from bs4 import BeautifulSoup  # Used for parsing HTML content

# Function to extract data from the specified website
def scrape_h1_tags(url):
    # Send an HTTP GET request to the specified URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all the h1 tags on the page
        h1_tags = soup.find_all('h1')

        # Print the extracted h1 tags
        print(f"List all the h1 tags from {url}:")
        for tag in h1_tags:
            print(tag)
    else:
        # Print an error message if the request was not successful
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

# Specify the URL to scrape
url = "https://en.wikipedia.org/wiki/Main_Page"

# Call the function to scrape h1 tags
scrape_h1_tags(url)

Output:

List all the h1 tags from https://en.wikipedia.org/wiki/Main_Page:
<h1 class="firstHeading mw-first-heading" id="firstHeading" style="display: none"><span class="mw-page-title-main">Main Page</span></h1>
<h1 id="Welcome_to_Wikipedia">Welcome to <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a></h1>

Explanation:

  • The script defines a function 'scrape_h1_tags()' that takes a URL as input and performs the following steps:
    • Sends an HTTP GET request to the URL using 'requests.get()'.
    • Checks if the request was successful by examining the status code.
    • Parses the HTML content of the page using 'BeautifulSoup'.
    • Finds all h1 tags on the page using 'soup.find_all('h1')'.
    • Prints the extracted 'h1' tags.
  • This solution is straightforward and works well for simple scraping tasks, but it's less modular and reusable for more complex scenarios.

Solution 2: Using a Class-Based approach for Reusability and Extensibility

Code:

# Solution 2: Using a Class-Based Approach for Reusability and Extensibility

import requests  # Used to send HTTP requests
from bs4 import BeautifulSoup  # Used for parsing HTML content

class WebScraper:
    """Class to handle web scraping operations"""

    def __init__(self, url):
        """Initialize the scraper with a URL"""
        self.url = url
        self.soup = None

    def fetch_content(self):
        """Fetch content from the website and initialize BeautifulSoup"""
        try:
            # Send an HTTP GET request to the specified URL
            response = requests.get(self.url)

            # Check if the request was successful
            if response.status_code == 200:
                # Initialize BeautifulSoup with the content
                self.soup = BeautifulSoup(response.text, 'html.parser')
                print(f"Successfully fetched content from {self.url}")
            else:
                print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
        except requests.RequestException as e:
            # Handle any exceptions that occur during the request
            print(f"An error occurred: {e}")

    def extract_h1_tags(self):
        """Extract and display all h1 tags from the page content"""
        if self.soup:
            # Find all the h1 tags on the page
            h1_tags = self.soup.find_all('h1')

            # Print the extracted h1 tags
            print(f"List all the h1 tags from {self.url}:")
            for tag in h1_tags:
                print(tag)
        else:
            print("No content fetched. Please call fetch_content() first.")

# Specify the URL to scrape
url = "https://en.wikipedia.org/wiki/Main_Page"

# Create an instance of the WebScraper class
scraper = WebScraper(url)

# Fetch content from the website
scraper.fetch_content()

# Extract and display h1 tags
scraper.extract_h1_tags()

Output:

Successfully fetched content from https://en.wikipedia.org/wiki/Main_Page
List all the h1 tags from https://en.wikipedia.org/wiki/Main_Page:
<h1 class="firstHeading mw-first-heading" id="firstHeading" style="display: none"><span class="mw-page-title-main">Main Page</span></h1>
<h1 id="Welcome_to_Wikipedia">Welcome to <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a></h1>

Explanation:

  • The script defines a 'WebScraper' class that encapsulates all web scraping functionality, making it more organized and easier to extend.
  • The '__init__' method initializes the class with the URL to be scraped.
  • The 'fetch_content' method sends an HTTP GET request, checks the response, and initializes ‘BeautifulSoup’ with the page content.
  • The 'extract_h1_tags' method extracts and prints all 'h1' tags from the page.
  • This approach allows for better reusability and extensibility, making it easier to add more features (e.g., extracting different tags, handling different URLs) in the future.

Note:
Both solutions effectively scrape 'h1' tags from a specified website using 'requests' and 'BeautifulSoup'. Solution 1 is a functional, straightforward approach, while Solution 2 uses Object-Oriented Programming (OOP) principles for a more modular and maintainable design.



Become a Patron!

Follow us on Facebook and Twitter for latest update.