Crawl Websites in Bash: Lightweight Link Extractor with curl & grep

If you’ve ever wanted to explore webpages programmatically without relying on heavyweight tools or libraries, Bash can be a surprisingly capable ally. In this tutorial, we’ll create a minimalist web crawler that extracts all hyperlinks from any URL using native Unix tools: curl, grep, and sed.

1. Introduction to Bash Web Crawling

Web scraping and crawling are typically associated with powerful languages like Python or JavaScript. However, Bash is an excellent option for quick exploration, especially if you’re working in a terminal-focused environment like a remote server, Docker container, or minimal Linux distro. Using just a few CLI tools, we can create a fully functional link extractor.

Our goal is simple: fetch the HTML of a webpage, find all <a href=""> tags, and extract the URLs using nothing but Bash.

2. Fetching Page Content with curl

The first step is to retrieve the raw HTML content of a target webpage. We’ll use curl, a tool designed for transferring data with URLs.

curl -s https://example.com

The -s flag silences progress and error messages for cleaner output. To save this for further processing, use a command substitution or pipe it into another tool.

Example:

html=$(curl -s https://example.com)

Now $html contains the entire HTML of the target page.

3. Extracting Hyperlinks Using grep and sed

To extract hyperlinks, we need to find all anchor (<a>) tags with href attributes. Here’s a simple but effective one-liner that does this:

echo "$html" | grep -oP '(?i)href=["\']\K[^"\']+(?=["\'])'

Let’s break it down:

grep -oP: Enables Perl-compatible regular expressions with output limited to matched portions only.
(?i): Makes the regex case-insensitive.
href=["\']\K[^"\']+(?=["\']): Extracts the value within href="..." or href='...'.

This line works well for most simple pages, although HTML is notoriously inconsistent, so improvements are always possible using tools like xmllint or pup.

4. Turning It into a Reusable Bash Script

Let’s convert our extraction logic into a script that accepts a URL as an argument and prints out all found links.

#!/bin/bash

# Check for URL arg
if [ -z "$1" ]; then
  echo "Usage: $0 <URL>"
  exit 1
fi

# Fetch HTML
html=$(curl -s "$1")

# Extract and print links
echo "$html" | grep -oP '(?i)href=["\']\K[^"\']+(?=["\'])'

Save this as extract_links.sh, and run:

chmod +x extract_links.sh
./extract_links.sh https://example.com

The script will print a list of hyperlinks found on the page. You can wrap it into more advanced workflows or pipe the output into other scripts.

5. Real-World Use Cases and Tips

This tool is useful for:

Checking internal broken links on a personal website.
Monitoring newly added articles or download links.
Automating downloads by crawling pages and feeding URLs into wget or aria2.

Performance and Optimization Tips:

Use curl --compressed to handle compressed content.
Filter only internal or external links using grep '^/' or grep '^http'.
Consider using tools like pup (https://github.com/ericchiang/pup) or hxselect for more reliable HTML parsing.

6. Expanding the Crawler

The current implementation only fetches and parses a single page. To build a simple recursive crawler, consider this strategy:

Extract all links from a base URL.
Filter for internal links (to stay within the same domain).
Store visited URLs in a file or associative array.
Repeat extraction for unvisited links.

Here’s a pseudo-Bash snippet to illustrate recursion:

crawl() {
  local url=$1
  echo "Visiting $url"
  links=$(curl -s "$url" | grep -oP '(?i)href=["\']\K[^"\']+(?=["\'])')
  for link in $links; do
    # Normalize and check whether to proceed
    # Recurse if relevant
  done
}

Use caution when crawling to avoid hammering websites. Always respect robots.txt and rate-limit your requests.

Conclusion

Using Bash combined with tools like curl and grep, you can achieve powerful web crawling capabilities in lightweight environments without needing full libraries or frameworks. This method is great for personal automations, quick inspections, or shell scripting workflows. With a bit more logic and filtering, you can turn this into a full-fledged CLI micro-crawler.

Happy crawling from your terminal!

Useful links: