Find Broken Links: Web Crawler in Python Using Requests and BeautifulSoup

Broken links are a quality and SEO killer. Whether you’re managing a personal blog, a portfolio, or an enterprise site, identifying and fixing dead links is crucial. In this tutorial, we’ll build a Python-based multithreaded web crawler using requests and BeautifulSoup to automatically scan a website and log broken internal and external URLs.

1. Setting Up Your Environment

Before diving into code, make sure you have the necessary libraries installed:

pip install requests beautifulsoup4

We’ll also use the built-in threading and queue modules for multithreading support. Here’s the stack we’ll use:

requests – fetching URLs
BeautifulSoup – extracting links from HTML
threading / queue – concurrent crawling
urllib.parse – handling URLs

2. Basic Crawler: Extracting All Links from a Page

Let’s start by fetching a page and extracting all the href links.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def extract_links(url):
    try:
        response = requests.get(url, timeout=5)
        soup = BeautifulSoup(response.text, 'html.parser')
        links = set()
        for a_tag in soup.find_all('a', href=True):
            href = a_tag['href']
            full_url = urljoin(url, href)
            links.add(full_url)
        return links
    except requests.RequestException:
        return set()

This function fetches all <a href="..."> links and resolves relative paths to absolute URLs using urljoin.

3. Detecting Broken Links

To verify a URL, we can issue a HEAD or GET request and check the HTTP status code. A 4xx or 5xx response usually indicates a broken link.

def is_broken_link(url):
    try:
        response = requests.head(url, allow_redirects=True, timeout=5)
        return response.status_code >= 400
    except requests.RequestException:
        return True

requests.head is lighter than requests.get and sufficient in most cases to determine availability. We allow redirects to follow typical web behaviors.

4. Adding Multi-threading for Efficiency

Crawling large websites sequentially is slow. With threading, we can parallelize our spider. Here’s a multithreaded crawler skeleton:

import threading
import queue

visited = set()
broken_links = []
to_visit = queue.Queue()

lock = threading.Lock()

def crawl():
    while not to_visit.empty():
        url = to_visit.get()
        with lock:
            if url in visited:
                to_visit.task_done()
                continue
            visited.add(url)

        links = extract_links(url)
        for link in links:
            parsed = urlparse(link)
            if parsed.netloc == base_domain:
                to_visit.put(link)

            if is_broken_link(link):
                with lock:
                    broken_links.append(link)

        to_visit.task_done()

This worker thread keeps consuming URLs from the queue, visits unvisited links, and checks them. It appends broken ones to a shared list, protected by a lock for thread safety.

5. Assembling the Program & Entry Point

Let’s put it all together in a main function:

def start_crawling(start_url, num_threads=5):
    global base_domain
    base_domain = urlparse(start_url).netloc

    to_visit.put(start_url)

    threads = []
    for _ in range(num_threads):
        t = threading.Thread(target=crawl)
        t.daemon = True
        t.start()
        threads.append(t)

    to_visit.join()

    print("Broken links found:")
    for link in broken_links:
        print(link)

if __name__ == '__main__':
    start_url = input("Enter the starting URL (e.g., https://example.com): ")
    start_crawling(start_url)

We initialize the domain, enqueue the start URL, fire up threads, and wait for crawling to complete. This outputs all detected broken links at the end.

6. Performance & Usage Tips

Respect robots.txt: In production, read robots.txt to avoid crawling restricted paths.
Avoid external spam: You may want to limit crawling to internal links only by matching netloc.
Use retry/backoff: Implement retry logic for transient network errors.
Export logs: Save broken links to a file or spreadsheet for auditing.

This solution works well for small to mid-sized sites. For enterprise-scale or dynamic websites, consider using tools like Selenium or headless browsers to capture JavaScript-generated links.

Conclusion

A simple multithreaded web crawler in Python can be a powerful tool in automating website maintenance. With just requests, BeautifulSoup, and Python’s concurrency features, we can quickly identify broken links and improve the health of any site. This project is also a great starting point for more advanced tools in crawling, SEO automation, or web analytics.

Useful links: