w3resource

Build a URL Scraper in Python to Extract URLs from Webpages


URL Scraper:

Build a program that extracts URLs from a given webpage.

Input values:

User provides the URL of a webpage from which URLs need to be extracted.

Output value:

List of URLs extracted from the given webpage.

Example:

Input values:
Enter the URL of the webpage: https://www.example.com
Output value:
URLs extracted from the webpage:
URLs extracted from https://www.example.com:
1. https://www.iana.org/domains/example

Solution 1: URL Scraper Using requests and BeautifulSoup

This solution uses the requests library to fetch the webpage content and BeautifulSoup from bs4 to parse the HTML and extract all URLs.

Code:

import requests  # Import requests to make HTTP requests
from bs4 import BeautifulSoup  # Import BeautifulSoup for HTML parsing

def extract_urls(webpage_url):
    """Extracts all URLs from a given webpage."""
    try:
        # Send a GET request to the webpage
        response = requests.get(webpage_url)
        response.raise_for_status()  # Raise an error if the request was unsuccessful

        # Parse the webpage content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find all anchor tags with href attribute
        anchor_tags = soup.find_all('a', href=True)
        
        # Extract URLs from the anchor tags
        urls = [tag['href'] for tag in anchor_tags if tag['href'].startswith('http')]
        
        # Print the extracted URLs
        print(f"URLs extracted from {webpage_url}:")
        for idx, url in enumerate(urls, 1):
            print(f"{idx}. {url}")

    except requests.exceptions.RequestException as e:
        print(f"Error fetching webpage: {e}")

# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url)  # Call the function to extract URLs 

Output:

Enter the URL of the webpage: https://www.example.com
URLs extracted from https://www.example.com:
1. https://www.iana.org/domains/example

Explanation:

  • Imports requests for making HTTP requests to fetch webpage content.
  • Imports BeautifulSoup from bs4 for parsing HTML content.
  • extract_urls(webpage_url) function:
    • Sends a GET request to the provided URL.
    • Parses the response using BeautifulSoup to find all anchor tags (<a>).
    • Extracts URLs that start with "http" to ensure they are absolute URLs.
    • Prints the list of extracted URLs.
  • Error Handling:
    • Catches and prints any exceptions related to HTTP requests.
  • Input from User:
    • Takes a URL as input from the user and calls the extract_urls() function.

Solution 2: URL Scraper Using urllib and re (Regular Expressions)

This solution uses the urllib library to fetch webpage content and regular expressions to extract URLs directly from the HTML.

Code:

 import urllib.request  # Import urllib to handle HTTP requests
import re  # Import re for regular expression matching

def extract_urls(webpage_url):
    """Extracts all URLs from a given webpage."""
    try:
        # Open the URL and read the webpage content
        with urllib.request.urlopen(webpage_url) as response:
            html_content = response.read().decode('utf-8')  # Decode the content to a string format
        
        # Regular expression to find all URLs in the HTML content
        urls = re.findall(r'href=["\'](http[s]?://[^\s"\'<>]+)["\']', html_content)
        
        # Print the extracted URLs
        print(f"URLs extracted from {webpage_url}:")
        for idx, url in enumerate(urls, 1):
            print(f"{idx}. {url}")

    except urllib.error.URLError as e:
        print(f"Error fetching webpage: {e}")

# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url)  # Call the function to extract URLs

Output:

Enter the URL of the webpage: https://www.example.com
URLs extracted from https://www.example.com:
1. https://www.iana.org/domains/example

Explanation:

  • Imports urllib.request to handle HTTP requests and fetch webpage content.
  • Imports re for using regular expressions to match patterns in the HTML content.
  • extract_urls(webpage_url) function:
    • Opens the URL using urllib.request.urlopen() and reads the webpage content.
    • Uses re.findall() with a regular expression to find all URLs in the HTML content.
    • Prints the list of extracted URLs.
  • Error Handling:
    • Catches and prints any exceptions related to URL errors.
  • Input from User:
    • Takes a URL as input from the user and calls the extract_urls() function.

Summary:

Solution 1 (Requests and BeautifulSoup): Uses requests and BeautifulSoup for a more Pythonic approach to parse and extract URLs from HTML. This method is easier to read and maintain and is suitable for more complex HTML structures.

Solution 2 (urllib and Regular Expressions): Uses urllib and regular expressions, which is a lightweight approach that works well for simple URL extraction. However, it may not handle complex HTML structures as robustly as BeautifulSoup.



Become a Patron!

Follow us on Facebook and Twitter for latest update.

It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.

https://w3resource.com/projects/python/python-url-scraper-project.php