Java thread Programming - Simultaneous Website Crawling
Write a Java program to implement a concurrent web crawler that crawls multiple websites simultaneously using threads.
jsoup: Java HTML Parser
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download and install jsoup from here.
Sample Solution:
Java Code:
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Web_Crawler {
private static final int MAX_DEPTH = 2; // Maximum depth for crawling
private static final int MAX_THREADS = 4; // Maximum number of threads
private final Set < String > visitedUrls = new HashSet < > ();
public void crawl(String url, int depth) {
if (depth > MAX_DEPTH || visitedUrls.contains(url)) {
System.out.println("Crawling: " + url);
try {
Document document = Jsoup.connect(url).get();
Elements links ="a[href]");
for (Element link: links) {
String nextUrl = link.absUrl("href");
crawl(nextUrl, depth + 1);
} catch (IOException e) {
public void processPage(Document document) {
// Process the web page content as needed
System.out.println("Processing: " + document.title());
public void startCrawling(String[] seedUrls) {
ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);
for (String url: seedUrls) {
executor.execute(() -> crawl(url, 0));
try {
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
System.out.println("Crawling completed.");
public static void main(String[] args) {
// Add URLs here
String[] seedUrls = {
Web_Crawler webCrawler = new Web_Crawler();
Sample Output:
Crawling: Crawling: Processing: Wikipedia Crawling: Processing: Example Domain Crawling: Processing: Wikipedia, the free encyclopedia Crawling: Processing: Wikipedia, the free encyclopedia Crawling: Processing: Wikipedia, the free encyclopedia Crawling: Processing: Wikipedia:Contents - Wikipedia Crawling: Processing: Portal:Current events - Wikipedia Crawling: Processing: IANA-managed Reserved Domains Crawling: Processing: Papilio birchallii - Wikipedia Crawling: Processing: Wikipedia:About - Wikipedia Crawling: Processing: Internet Assigned Numbers Authority Crawling: Processing: Wikipedia:Contact us - Wikipedia Crawling: Processing: Domain Name Services Crawling: Processing: Make your donation now - Wikimedia Foundation Crawling: Processing: Help:Contents - Wikipedia Crawling: Processing: Help:Introduction - Wikipedia
Pictorial Presentation:

In the above exercise,
- The Web_Crawler class crawls web pages. It has two constants:
- MAX_DEPTH: Represents the maximum depth to which the crawler explores links on a web page.
- MAX_THREADS: Represents the maximum number of threads to use for crawling.
- The class maintains a Set<String> called visitedUrls to keep track of the URLs visited during crawling.
- The crawl(String url, int depth) method crawls a given URL up to a specified depth. If the current depth exceeds MAX_DEPTH or if the URL has already been visited, the method returns. Otherwise, it adds the URL to the visitedUrls set. It prints a message indicating that the URL is being crawled, and retrieves the web page using the Jsoup library.
- The processPage(Document document) method represents web page processing. In this example, it simply prints the document title. You can customize this method to perform specific operations on web page content.
- The startCrawling(String[] seedUrls) method initiates the crawling process. It creates a fixed-size thread pool using ExecutorService and Executors.newFixedThreadPool() with a maximum number of threads specified by MAX_THREADS. It then submits crawl tasks for each seed URL in the seedUrls array to the thread pool for concurrent execution.
- After submitting all the tasks, the method shuts down the executor, waits for all the tasks to complete using executor.awaitTermination(), and prints a completion message.

For more Practice: Solve these Related Problems:
- Write a Java program to implement a concurrent web crawler using threads that fetch URLs from a shared queue and process them simultaneously.
- Write a Java program to create a multi-threaded web crawler that uses synchronized blocks to prevent duplicate URL processing.
- Write a Java program to implement a web crawler using ExecutorService and Callable tasks to fetch website content with a timeout mechanism.
- Write a Java program to build a concurrent web crawler that aggregates data from multiple websites and stores results in a thread-safe collection.
Java Code Editor:
Improve this sample solution and post your code through Disqus
Previous: Multithreaded Java Program: Sum of Prime Numbers.
Next: Concurrent Bank Account in Java: Thread-Safe Deposits and Withdrawals.
What is the difficulty level of this exercise?
Test your Programming skills with w3resource's quiz.
- Weekly Trends and Language Statistics
- Weekly Trends and Language Statistics