w3resource

Python Project: Extract Information from URLs


URL Analyzer: Build a program that analyzes and extracts information from a given URL.

Input values:

User provides a URL to be analyzed.

Output value:

Extract information and analysis results from the given URL.

Example:

Input values:
URL to analyze: https://www.example.com/about-us
Output value:
Analysis results:
- Domain: example.com
- Protocol: HTTPS
- Path: /about-us
- Query parameters: None
- HTTP status: 200 OK
- Page title: About Us - Example
- Meta description: Learn more about our company and our mission.
Input values:
URL to analyze: https://www.example.com/products?category=electronics
Output value:
Analysis results:
- Domain: example.com
- Protocol: HTTPS
- Path: /products
- Query parameters: category=electronics
- HTTP status: 200 OK
- Page title: Products - Example
- Meta description: Browse our wide selection of electronics products.
Input values:
URL to analyze: https://www.example.com/non-existent-page
Output value:
Analysis results:
- Domain: example.com
- Protocol: HTTPS
- Path: /non-existent-page
- Query parameters: None
- HTTP status: 404 Not Found
- Error message: The requested page does not exist.

Solution: Using requests and urllib Modules

Code:

# Import required modules
import requests  # For HTTP requests
from urllib.parse import urlparse, parse_qs  # For URL parsing

# Function to analyze a given URL
def analyze_url(url):
    # Parse URL to extract components
    parsed_url = urlparse(url)
    protocol = parsed_url.scheme.upper()  # Extract and convert protocol to uppercase
    domain = parsed_url.netloc  # Extract domain
    path = parsed_url.path  # Extract path
    query = parse_qs(parsed_url.query)  # Parse query parameters into a dictionary

    # Make a request to the URL to get status and HTML content
    try:
        response = requests.get(url)
        status_code = response.status_code  # Extract HTTP status code
        html_content = response.text  # Get HTML content of the page

        # Extract page title and meta description if available
        page_title = extract_meta(html_content, "", "")
        meta_description = extract_meta(html_content, 'name="description" content="', '"')
        
        # Display the results
        print("Analysis results:")
        print(f"- Domain: {domain}")
        print(f"- Protocol: {protocol}")
        print(f"- Path: {path}")
        print(f"- Query parameters: {query if query else 'None'}")
        print(f"- HTTP status: {status_code}")
        print(f"- Page title: {page_title}")
        print(f"- Meta description: {meta_description if meta_description else 'None'}")

    except requests.RequestException as e:
        print(f"Error analyzing URL: {e}")

# Helper function to extract metadata from HTML content
def extract_meta(html, start_tag, end_tag):
    start_index = html.find(start_tag)
    if start_index == -1:
        return None
    start_index += len(start_tag)
    end_index = html.find(end_tag, start_index)
    return html[start_index:end_index].strip()

# Example usage
analyze_url("https://www.w3resource.com/")
#analyze_url("https://www.w3resource.com/privacy/")

Output:

Analysis results:
- Domain: www.w3resource.com
- Protocol: HTTPS
- Path: 
- Query parameters: None
- HTTP status: 200
- Page title: Web development tutorials | w3resource
- Meta description: Web development tutorials on HTML, CSS, JS, PHP, SQL, MySQL, PostgreSQL, MongoDB, JSON and more.
Analysis results:
- Domain: www.w3resource.com
- Protocol: HTTPS
- Path: /privacy/
- Query parameters: None
- HTTP status: 404
- Page title: 404 Not Found
- Meta description: None

Explanation:

  • URL Parsing: Extracts protocol, domain, path, and query parameters from the URL.
  • HTTP Request: Sends a GET request and retrieves HTTP status and HTML content.
  • Metadata Extraction: Extracts page title and meta description from the HTML.
  • Error Handling: Handles any request errors gracefully.


Become a Patron!

Follow us on Facebook and Twitter for latest update.

It will be nice if you may share this link in any developer community or anywhere else, from where other developers may find this content. Thanks.

https://w3resource.com/projects/python/python-project-url-analyzer.php