Intro to Web Scraping: Bash + cURL + grep = Fast Data Extraction

Intro to Web Scraping: Bash + cURL + grep = Fast Data Extraction

Intro to Web Scraping: Bash + cURL + grep = Fast Data Extraction

 

When it comes to quickly extracting data from websites, few tools offer the combined power and speed of the classic Unix trio: bash, cURL, and grep. In this guide, you’ll learn how to create a fast and effective web scraping script using nothing more than tools already available on most Linux and macOS terminals.

We’ll walk through scraping headlines from a news website in less than 20 lines of shell code. It’s fast, efficient, and a perfect introduction to lightweight automation using shell scripting.

1. Understanding the Stack: How Bash + cURL + grep Work Together

Bash is the scripting shell that lets you automate tasks. cURL is a command-line tool to fetch data over HTTP. grep is a utility for filtering text using regular expressions.

Together, they form a pipeline:

curl <url> | grep '<pattern>'

This pattern allows us to fetch HTML from a web page and pull out just the parts we care about—in this case, headlines.

2. Fetching Web Page HTML with cURL

We start by using cURL to download the raw HTML of a news site. For example, to grab the home page of BBC News:

curl -s https://www.bbc.com/news

The -s flag silences the progress meter so output is clean. This fetches the entire HTML of the page, which we will now inspect to find the pattern for headlines.

You can preview the structure using:

curl -s https://www.bbc.com/news | head -n 100

Look for repeated tags or classes. At the time of writing, BBC wraps many headlines in tags like:

<h3 class="gs-c-promo-heading__title">Headline Text</h3>

3. Filtering Headlines Using grep

Now let’s pipe the HTML into grep to extract only the lines containing that tag:

curl -s https://www.bbc.com/news | grep -oP '<h3 class="gs-c-promo-heading__title.*?>.*?</h3>'

Here’s what’s happening:

  • -o: Only outputs the matched text.
  • -P: Enables Perl-compatible regex, allowing for lazy quantifiers like .*?.

This returns a list of raw HTML headline elements. For more readable results, let’s strip out the tags.

4. Cleaning the Output: Removing HTML Tags

To extract just the text content, we can run the output through another sed command:

curl -s https://www.bbc.com/news \
  | grep -oP '<h3 class="gs-c-promo-heading__title.*?>.*?</h3>' \
  | sed -E 's/<[^>]+>//g'

sed -E 's/<[^>]+>//g' removes all HTML tags using a basic regular expression.

The final output is a clean list of news headlines scraped directly from the site.

5. Full Script: Bash-Powered Headline Scraper

You can now turn this into a script:

#!/bin/bash

# Scrape top headlines from BBC News
url="https://www.bbc.com/news"

curl -s "$url" \
  | grep -oP '<h3 class="gs-c-promo-heading__title.*?>.*?</h3>' \
  | sed -E 's/<[^>]+>//g' \
  | head -n 10

This script will display the top 10 headlines. Save this as get_headlines.sh, give it execute permissions with chmod +x get_headlines.sh, and run it.

6. Tips and Considerations

  • Website structure changes: HTML elements and classes can change, breaking your script. Always re-check patterns if results seem off.
  • Rate limiting: Don’t hit websites too frequently. Add sleep commands or use cron to schedule scrapes responsibly.
  • Publication restrictions: Always check a website’s robots.txt and terms of service before scraping their content.
  • Alternatives: For more complex scraping, consider tools like BeautifulSoup in Python, but for fast one-off jobs, Bash is tough to beat.
  • Performance: This approach is extremely fast for simple parsing tasks and avoids dependency bloat common in heavier languages.

Conclusion

Bash + cURL + grep is a quick and effective way to scrape structured data from web pages when performance and minimalism matter. It’s perfect for CLI dashboards, logs, automation scripts, or lightweight data analysis. Try adapting this method to other sites or tags—your command-line scraper toolkit has just leveled up!

 

Useful links: