Build a URL Scraper in Python to Extract URLs from Webpages
URL Scraper:
Build a program that extracts URLs from a given webpage.
Input values:
User provides the URL of a webpage from which URLs need to be extracted.
Output value:
List of URLs extracted from the given webpage.
Example:
Input values: Enter the URL of the webpage: https://www.example.com Output value: URLs extracted from the webpage: URLs extracted from https://www.example.com: 1. https://www.iana.org/domains/example
Solution 1: URL Scraper Using requests and BeautifulSoup
This solution uses the requests library to fetch the webpage content and BeautifulSoup from bs4 to parse the HTML and extract all URLs.
Code:
import requests # Import requests to make HTTP requests
from bs4 import BeautifulSoup # Import BeautifulSoup for HTML parsing
def extract_urls(webpage_url):
"""Extracts all URLs from a given webpage."""
try:
# Send a GET request to the webpage
response = requests.get(webpage_url)
response.raise_for_status() # Raise an error if the request was unsuccessful
# Parse the webpage content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Find all anchor tags with href attribute
anchor_tags = soup.find_all('a', href=True)
# Extract URLs from the anchor tags
urls = [tag['href'] for tag in anchor_tags if tag['href'].startswith('http')]
# Print the extracted URLs
print(f"URLs extracted from {webpage_url}:")
for idx, url in enumerate(urls, 1):
print(f"{idx}. {url}")
except requests.exceptions.RequestException as e:
print(f"Error fetching webpage: {e}")
# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url) # Call the function to extract URLs
Output:
Enter the URL of the webpage: https://www.example.com URLs extracted from https://www.example.com: 1. https://www.iana.org/domains/example
Explanation:
- Imports requests for making HTTP requests to fetch webpage content.
- Imports BeautifulSoup from bs4 for parsing HTML content.
- extract_urls(webpage_url) function:
- Sends a GET request to the provided URL.
- Parses the response using BeautifulSoup to find all anchor tags (<a>).
- Extracts URLs that start with "http" to ensure they are absolute URLs.
- Prints the list of extracted URLs.
- Error Handling:
- Catches and prints any exceptions related to HTTP requests.
- Input from User:
- Takes a URL as input from the user and calls the extract_urls() function.
Solution 2: URL Scraper Using urllib and re (Regular Expressions)
This solution uses the urllib library to fetch webpage content and regular expressions to extract URLs directly from the HTML.
Code:
import urllib.request # Import urllib to handle HTTP requests
import re # Import re for regular expression matching
def extract_urls(webpage_url):
"""Extracts all URLs from a given webpage."""
try:
# Open the URL and read the webpage content
with urllib.request.urlopen(webpage_url) as response:
html_content = response.read().decode('utf-8') # Decode the content to a string format
# Regular expression to find all URLs in the HTML content
urls = re.findall(r'href=["\'](http[s]?://[^\s"\'<>]+)["\']', html_content)
# Print the extracted URLs
print(f"URLs extracted from {webpage_url}:")
for idx, url in enumerate(urls, 1):
print(f"{idx}. {url}")
except urllib.error.URLError as e:
print(f"Error fetching webpage: {e}")
# Input: Get URL from user
webpage_url = input("Enter the URL of the webpage: ")
extract_urls(webpage_url) # Call the function to extract URLs
Output:
Enter the URL of the webpage: https://www.example.com URLs extracted from https://www.example.com: 1. https://www.iana.org/domains/example
Explanation:
- Imports urllib.request to handle HTTP requests and fetch webpage content.
- Imports re for using regular expressions to match patterns in the HTML content.
- extract_urls(webpage_url) function:
- Opens the URL using urllib.request.urlopen() and reads the webpage content.
- Uses re.findall() with a regular expression to find all URLs in the HTML content.
- Prints the list of extracted URLs.
- Error Handling:
- Catches and prints any exceptions related to URL errors.
- Input from User:
- Takes a URL as input from the user and calls the extract_urls() function.
Summary:
Solution 1 (Requests and BeautifulSoup): Uses requests and BeautifulSoup for a more Pythonic approach to parse and extract URLs from HTML. This method is easier to read and maintain and is suitable for more complex HTML structures.
Solution 2 (urllib and Regular Expressions): Uses urllib and regular expressions, which is a lightweight approach that works well for simple URL extraction. However, it may not handle complex HTML structures as robustly as BeautifulSoup.
It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.
https://w3resource.com/projects/python/python-url-scraper-project.php
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics