Find Broken Links: Web Crawler in Python Using Requests and BeautifulSoup
Broken links are a quality and SEO killer. Whether you’re managing a personal blog, a portfolio, or an enterprise site, identifying and fixing dead links is crucial. In this tutorial, we’ll build a Python-based multithreaded web crawler using requests and BeautifulSoup to automatically scan a website and log broken internal and external URLs.
1. Setting Up Your Environment
Before diving into code, make sure you have the necessary libraries installed:
pip install requests beautifulsoup4
We’ll also use the built-in threading and queue modules for multithreading support. Here’s the stack we’ll use:
- requests – fetching URLs
- BeautifulSoup – extracting links from HTML
- threading / queue – concurrent crawling
- urllib.parse – handling URLs
2. Basic Crawler: Extracting All Links from a Page
Let’s start by fetching a page and extracting all the href links.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def extract_links(url):
try:
response = requests.get(url, timeout=5)
soup = BeautifulSoup(response.text, 'html.parser')
links = set()
for a_tag in soup.find_all('a', href=True):
href = a_tag['href']
full_url = urljoin(url, href)
links.add(full_url)
return links
except requests.RequestException:
return set()
This function fetches all <a href="..."> links and resolves relative paths to absolute URLs using urljoin.
3. Detecting Broken Links
To verify a URL, we can issue a HEAD or GET request and check the HTTP status code. A 4xx or 5xx response usually indicates a broken link.
def is_broken_link(url):
try:
response = requests.head(url, allow_redirects=True, timeout=5)
return response.status_code >= 400
except requests.RequestException:
return True
requests.head is lighter than requests.get and sufficient in most cases to determine availability. We allow redirects to follow typical web behaviors.
4. Adding Multi-threading for Efficiency
Crawling large websites sequentially is slow. With threading, we can parallelize our spider. Here’s a multithreaded crawler skeleton:
import threading
import queue
visited = set()
broken_links = []
to_visit = queue.Queue()
lock = threading.Lock()
def crawl():
while not to_visit.empty():
url = to_visit.get()
with lock:
if url in visited:
to_visit.task_done()
continue
visited.add(url)
links = extract_links(url)
for link in links:
parsed = urlparse(link)
if parsed.netloc == base_domain:
to_visit.put(link)
if is_broken_link(link):
with lock:
broken_links.append(link)
to_visit.task_done()
This worker thread keeps consuming URLs from the queue, visits unvisited links, and checks them. It appends broken ones to a shared list, protected by a lock for thread safety.
5. Assembling the Program & Entry Point
Let’s put it all together in a main function:
def start_crawling(start_url, num_threads=5):
global base_domain
base_domain = urlparse(start_url).netloc
to_visit.put(start_url)
threads = []
for _ in range(num_threads):
t = threading.Thread(target=crawl)
t.daemon = True
t.start()
threads.append(t)
to_visit.join()
print("Broken links found:")
for link in broken_links:
print(link)
if __name__ == '__main__':
start_url = input("Enter the starting URL (e.g., https://example.com): ")
start_crawling(start_url)
We initialize the domain, enqueue the start URL, fire up threads, and wait for crawling to complete. This outputs all detected broken links at the end.
6. Performance & Usage Tips
- Respect robots.txt: In production, read robots.txt to avoid crawling restricted paths.
- Avoid external spam: You may want to limit crawling to internal links only by matching
netloc. - Use retry/backoff: Implement retry logic for transient network errors.
- Export logs: Save broken links to a file or spreadsheet for auditing.
This solution works well for small to mid-sized sites. For enterprise-scale or dynamic websites, consider using tools like Selenium or headless browsers to capture JavaScript-generated links.
Conclusion
A simple multithreaded web crawler in Python can be a powerful tool in automating website maintenance. With just requests, BeautifulSoup, and Python’s concurrency features, we can quickly identify broken links and improve the health of any site. This project is also a great starting point for more advanced tools in crawling, SEO automation, or web analytics.
Useful links:


