Python Project - Basic URL Crawler for extract URLs
Basic URL Crawler:
Develop a program that crawls a website and extracts URLs.
Input values:
- Starting URL: The URL from which the crawler will start.
- Depth (optional): The number of levels the crawler will follow links from the starting URL.
- Optional Parameters:
- Domain restriction: Whether to restrict crawling to the same domain as the starting URL.
- File types to include or exclude (e.g., only HTML pages).
Output value:
- Extracted URLs: A list of URLs found during the crawling process.
- Status Messages:
- Progress updates.
- Error messages if the crawling fails (e.g., invalid URL, network issues).
Example:
Example 1: Basic Crawling from a Starting URL
Input: • Starting URL: http://example.com Output: • List of extracted URLs: http://example.com/page1 http://example.com/page2 http://example.com/about http://example.com/contact Example Console Output: Starting URL: http://example.com Crawling depth: 1 Crawling http://example.com... Found URL: http://example.com/page1 Found URL: http://example.com/page2 Found URL: http://example.com/about Found URL: http://example.com/contact Crawling completed. Example 2: Crawling with Depth Restriction Input: • Starting URL: http://example.com • Depth: 2 Output: • List of extracted URLs: http://example.com/page1 http://example.com/page2 http://example.com/about http://example.com/contact http://example.com/page1/subpage1 http://example.com/page2/subpage2 Example Console Output: Starting URL: http://example.com Crawling depth: 2 Crawling http://example.com... Found URL: http://example.com/page1 Found URL: http://example.com/page2 Found URL: http://example.com/about Found URL: http://example.com/contact Crawling http://example.com/page1... Found URL: http://example.com/page1/subpage1 Crawling http://example.com/page2... Found URL: http://example.com/page2/subpage2 Crawling completed. Example 3: Domain Restriction Input: • Starting URL: http://example.com • Domain restriction: Yes Output: • List of extracted URLs (only from the same domain): http://example.com/page1 http://example.com/page2 http://example.com/about http://example.com/contact Example Console Output: Starting URL: http://example.com Crawling depth: 1 Domain restriction: Yes Crawling http://example.com... Found URL: http://example.com/page1 Found URL: http://example.com/page2 Found URL: http://example.com/about Found URL: http://example.com/contact Crawling completed.
Here are two different solutions for building a basic URL crawler that crawls a website and extracts URLs. The first solution uses the requests and BeautifulSoup libraries to perform a simple crawl, while the second solution uses the Scrapy framework for more advanced crawling capabilities.
Prerequisites for Both Solutions:
Install Required Python Libraries:
pip install requests beautifulsoup4 scrapy
Solution: Basic URL Crawler Using requests and BeautifulSoup
This solution uses the requests library to fetch web pages and BeautifulSoup to parse the HTML content and extract URLs.
Code:
# Solution 1: Basic URL Crawler Using 'requests' and 'BeautifulSoup'
import requests # Library for making HTTP requests
from bs4 import BeautifulSoup # Library for parsing HTML content
from urllib.parse import urljoin, urlparse # Functions to handle URL joining and parsing
def crawl_website(starting_url, depth=1, domain_restriction=True):
"""Crawl a website starting from a given URL to extract URLs."""
# Set to store all discovered URLs
visited_urls = set()
# Helper function to recursively crawl the website
def crawl(url, current_depth):
"""Recursively crawl the website up to the specified depth."""
if current_depth > depth: # Stop crawling if the maximum depth is reached
return
print(f"Crawling {url}...")
try:
# Send a GET request to the URL
response = requests.get(url)
# Parse the content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Iterate over all <a> tags to find URLs
for link in soup.find_all('a', href=True):
# Resolve the full URL
full_url = urljoin(url, link['href'])
# Check domain restriction
if domain_restriction and urlparse(full_url).netloc != urlparse(starting_url).netloc:
continue
# Add the discovered URL to the set
if full_url not in visited_urls:
print(f"Found URL: {full_url}")
visited_urls.add(full_url)
# Recursively crawl the discovered URL
crawl(full_url, current_depth + 1)
except requests.RequestException as e:
print(f"Error crawling {url}: {e}")
# Start crawling from the starting URL
crawl(starting_url, 1)
print("Crawling completed.")
return visited_urls
# Example usage
starting_url = "https://www.python.org"
crawled_urls = crawl_website(starting_url, depth=2, domain_restriction=True)
print("Extracted URLs:", crawled_urls)
Output:
Crawling https://www.python.org... Found URL: https://www.python.org#content Crawling https://www.python.org#content... Found URL: https://www.python.org#python-network Found URL: https://www.python.org/ Found URL: https://www.python.org/psf/ Found URL: https://www.python.org/jobs/ Found URL: https://www.python.org/community-landing/ Found URL: https://www.python.org#top Found URL: https://www.python.org#site-map Found URL: https://www.python.org Found URL: https://www.python.org/community/irc/ Found URL: https://www.python.org/about/ Found URL: https://www.python.org/about/apps/ Found URL: https://www.python.org/about/quotes/ Found URL: https://www.python.org/about/gettingstarted/ Found URL: https://www.python.org/about/help/ Found URL: https://www.python.org/downloads/ Found URL: https://www.python.org/downloads/source/ Found URL: https://www.python.org/downloads/windows/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Explanation:
- Function crawl_website:
- Takes a starting URL, depth, and domain restriction as inputs to control the crawling process.
- Uses a recursive helper function crawl to navigate through the website up to the specified depth.
- Recursive Crawling:
- For each URL, it sends a GET request to fetch the HTML content, uses BeautifulSoup to parse the HTML, and iterates over all <a> tags to find links.
- Resolves relative URLs to absolute URLs using urljoin.
- Adds each new URL to a set to avoid duplicates and recursively crawls it if within the domain and depth limits.
- Domain Restriction and Error Handling:
- Checks if the domain restriction is enabled to restrict crawling to the same domain.
- Handles errors using requests.RequestException to manage network issues.
It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.
https://w3resource.com/projects/python/python-basic-url-crawler-project.php
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics