Build a URL Scraper in Python to Extract URLs from Webpages

Last update on October 07 2024 12:58:44 (UTC/GMT +8 hours)

URL Scraper:

Build a program that extracts URLs from a given webpage.

Input values:

User provides the URL of a webpage from which URLs need to be extracted.

Output value:

List of URLs extracted from the given webpage.

Example:

Input values:
Enter the URL of the webpage: https://www.example.com
Output value:
URLs extracted from the webpage:
URLs extracted from https://www.example.com:
1. https://www.iana.org/domains/example

Solution 1: URL Scraper Using requests and BeautifulSoup

This solution uses the requests library to fetch the webpage content and BeautifulSoup from bs4 to parse the HTML and extract all URLs.

Code:

import requests  # Import requests to make HTTP requests
from bs4 import BeautifulSoup  # Import BeautifulSoup for HTML parsing

def extract_urls(webpage_url):
    """Extracts all URLs from a given webpage."""
    try:
        # Send a GET request to the webpage
        response = requests.get(webpage_url)
        response.raise_for_status()  # Raise an error if the request was unsuccessful

        # Parse the webpage content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find all anchor tags with href attribute
        anchor_tags = soup.find_all('a', href=True)
        
        # Extract URLs from the anchor tags
        urls = [tag['href'] for tag in anchor_tags if tag['href'].startswith('http')]
        
        # Print the extracted URLs
        print(f"URLs extracted from {webpage_url}:")
        for idx, url in enumerate(urls, 1):
            print(f"{idx}. {url}")

    except requests.exceptions.RequestException as e:
        print(f"Error fetching webpage: {e}")

# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url)  # Call the function to extract URLs

Output:

Enter the URL of the webpage: https://www.example.com
URLs extracted from https://www.example.com:
1. https://www.iana.org/domains/example

Explanation:

Imports requests for making HTTP requests to fetch webpage content.
Imports BeautifulSoup from bs4 for parsing HTML content.
extract_urls(webpage_url) function:

Sends a GET request to the provided URL.
Parses the response using BeautifulSoup to find all anchor tags (<a>).
Extracts URLs that start with "http" to ensure they are absolute URLs.
Prints the list of extracted URLs.

Error Handling:

Catches and prints any exceptions related to HTTP requests.

Input from User:

Takes a URL as input from the user and calls the extract_urls() function.

Solution 2: URL Scraper Using urllib and re (Regular Expressions)

This solution uses the urllib library to fetch webpage content and regular expressions to extract URLs directly from the HTML.

Code:

 import urllib.request  # Import urllib to handle HTTP requests
import re  # Import re for regular expression matching

def extract_urls(webpage_url):
    """Extracts all URLs from a given webpage."""
    try:
        # Open the URL and read the webpage content
        with urllib.request.urlopen(webpage_url) as response:
            html_content = response.read().decode('utf-8')  # Decode the content to a string format
        
        # Regular expression to find all URLs in the HTML content
        urls = re.findall(r'href=["\'](http[s]?://[^\s"\'<>]+)["\']', html_content)
        
        # Print the extracted URLs
        print(f"URLs extracted from {webpage_url}:")
        for idx, url in enumerate(urls, 1):
            print(f"{idx}. {url}")

    except urllib.error.URLError as e:
        print(f"Error fetching webpage: {e}")

# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url)  # Call the function to extract URLs

Output:

Enter the URL of the webpage: https://www.example.com
URLs extracted from https://www.example.com:
1. https://www.iana.org/domains/example

Explanation:

Imports urllib.request to handle HTTP requests and fetch webpage content.
Imports re for using regular expressions to match patterns in the HTML content.
extract_urls(webpage_url) function:

Opens the URL using urllib.request.urlopen() and reads the webpage content.
Uses re.findall() with a regular expression to find all URLs in the HTML content.
Prints the list of extracted URLs.

Error Handling:

Catches and prints any exceptions related to URL errors.

Input from User:

Takes a URL as input from the user and calls the extract_urls() function.

Summary:

Solution 1 (Requests and BeautifulSoup): Uses requests and BeautifulSoup for a more Pythonic approach to parse and extract URLs from HTML. This method is easier to read and maintain and is suitable for more complex HTML structures.

Solution 2 (urllib and Regular Expressions): Uses urllib and regular expressions, which is a lightweight approach that works well for simple URL extraction. However, it may not handle complex HTML structures as robustly as BeautifulSoup.