Python Project - Basic URL Crawler for extract URLs
Basic URL Crawler:
Develop a program that crawls a website and extracts URLs.
Input values:
- Starting URL: The URL from which the crawler will start.
- Depth (optional): The number of levels the crawler will follow links from the starting URL.
- Optional Parameters:
- Domain restriction: Whether to restrict crawling to the same domain as the starting URL.
- File types to include or exclude (e.g., only HTML pages).
Output value:
- Extracted URLs: A list of URLs found during the crawling process.
- Status Messages:
- Progress updates.
- Error messages if the crawling fails (e.g., invalid URL, network issues).
Example:
Example 1: Basic Crawling from a Starting URL
Input: • Starting URL: http://example.com Output: • List of extracted URLs: http://example.com/page1 http://example.com/page2 http://example.com/about http://example.com/contact Example Console Output: Starting URL: http://example.com Crawling depth: 1 Crawling http://example.com... Found URL: http://example.com/page1 Found URL: http://example.com/page2 Found URL: http://example.com/about Found URL: http://example.com/contact Crawling completed. Example 2: Crawling with Depth Restriction Input: • Starting URL: http://example.com • Depth: 2 Output: • List of extracted URLs: http://example.com/page1 http://example.com/page2 http://example.com/about http://example.com/contact http://example.com/page1/subpage1 http://example.com/page2/subpage2 Example Console Output: Starting URL: http://example.com Crawling depth: 2 Crawling http://example.com... Found URL: http://example.com/page1 Found URL: http://example.com/page2 Found URL: http://example.com/about Found URL: http://example.com/contact Crawling http://example.com/page1... Found URL: http://example.com/page1/subpage1 Crawling http://example.com/page2... Found URL: http://example.com/page2/subpage2 Crawling completed. Example 3: Domain Restriction Input: • Starting URL: http://example.com • Domain restriction: Yes Output: • List of extracted URLs (only from the same domain): http://example.com/page1 http://example.com/page2 http://example.com/about http://example.com/contact Example Console Output: Starting URL: http://example.com Crawling depth: 1 Domain restriction: Yes Crawling http://example.com... Found URL: http://example.com/page1 Found URL: http://example.com/page2 Found URL: http://example.com/about Found URL: http://example.com/contact Crawling completed.
Here are two different solutions for building a basic URL crawler that crawls a website and extracts URLs. The first solution uses the requests and BeautifulSoup libraries to perform a simple crawl, while the second solution uses the Scrapy framework for more advanced crawling capabilities.
Prerequisites for Both Solutions:
Install Required Python Libraries:
pip install requests beautifulsoup4 scrapy
Solution: Basic URL Crawler Using requests and BeautifulSoup
This solution uses the requests library to fetch web pages and BeautifulSoup to parse the HTML content and extract URLs.
Code:
# Solution 1: Basic URL Crawler Using 'requests' and 'BeautifulSoup'
import requests # Library for making HTTP requests
from bs4 import BeautifulSoup # Library for parsing HTML content
from urllib.parse import urljoin, urlparse # Functions to handle URL joining and parsing
def crawl_website(starting_url, depth=1, domain_restriction=True):
"""Crawl a website starting from a given URL to extract URLs."""
# Set to store all discovered URLs
visited_urls = set()
# Helper function to recursively crawl the website
def crawl(url, current_depth):
"""Recursively crawl the website up to the specified depth."""
if current_depth > depth: # Stop crawling if the maximum depth is reached
return
print(f"Crawling {url}...")
try:
# Send a GET request to the URL
response = requests.get(url)
# Parse the content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Iterate over all <a> tags to find URLs
for link in soup.find_all('a', href=True):
# Resolve the full URL
full_url = urljoin(url, link['href'])
# Check domain restriction
if domain_restriction and urlparse(full_url).netloc != urlparse(starting_url).netloc:
continue
# Add the discovered URL to the set
if full_url not in visited_urls:
print(f"Found URL: {full_url}")
visited_urls.add(full_url)
# Recursively crawl the discovered URL
crawl(full_url, current_depth + 1)
except requests.RequestException as e:
print(f"Error crawling {url}: {e}")
# Start crawling from the starting URL
crawl(starting_url, 1)
print("Crawling completed.")
return visited_urls
# Example usage
starting_url = "https://www.python.org"
crawled_urls = crawl_website(starting_url, depth=2, domain_restriction=True)
print("Extracted URLs:", crawled_urls)
Output:
Crawling https://www.python.org... Found URL: https://www.python.org#content Crawling https://www.python.org#content... Found URL: https://www.python.org#python-network Found URL: https://www.python.org/ Found URL: https://www.python.org/psf/ Found URL: https://www.python.org/jobs/ Found URL: https://www.python.org/community-landing/ Found URL: https://www.python.org#top Found URL: https://www.python.org#site-map Found URL: https://www.python.org Found URL: https://www.python.org/community/irc/ Found URL: https://www.python.org/about/ Found URL: https://www.python.org/about/apps/ Found URL: https://www.python.org/about/quotes/ Found URL: https://www.python.org/about/gettingstarted/ Found URL: https://www.python.org/about/help/ Found URL: https://www.python.org/downloads/ Found URL: https://www.python.org/downloads/source/ Found URL: https://www.python.org/downloads/windows/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Explanation:
- Function crawl_website:
- Takes a starting URL, depth, and domain restriction as inputs to control the crawling process.
- Uses a recursive helper function crawl to navigate through the website up to the specified depth.
- Recursive Crawling:
- For each URL, it sends a GET request to fetch the HTML content, uses BeautifulSoup to parse the HTML, and iterates over all <a> tags to find links.
- Resolves relative URLs to absolute URLs using urljoin.
- Adds each new URL to a set to avoid duplicates and recursively crawls it if within the domain and depth limits.
- Domain Restriction and Error Handling:
- Checks if the domain restriction is enabled to restrict crawling to the same domain.
- Handles errors using requests.RequestException to manage network issues.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics