Intro to Web Scraping: Bash + cURL + grep = Fast Data Extraction
When it comes to quickly extracting data from websites, few tools offer the combined power and speed of the classic Unix trio: bash
, cURL
, and grep
. In this guide, you’ll learn how to create a fast and effective web scraping script using nothing more than tools already available on most Linux and macOS terminals.
We’ll walk through scraping headlines from a news website in less than 20 lines of shell code. It’s fast, efficient, and a perfect introduction to lightweight automation using shell scripting.
1. Understanding the Stack: How Bash + cURL + grep Work Together
Bash is the scripting shell that lets you automate tasks. cURL
is a command-line tool to fetch data over HTTP. grep
is a utility for filtering text using regular expressions.
Together, they form a pipeline:
curl <url> | grep '<pattern>'
This pattern allows us to fetch HTML from a web page and pull out just the parts we care about—in this case, headlines.
2. Fetching Web Page HTML with cURL
We start by using cURL
to download the raw HTML of a news site. For example, to grab the home page of BBC News:
curl -s https://www.bbc.com/news
The -s
flag silences the progress meter so output is clean. This fetches the entire HTML of the page, which we will now inspect to find the pattern for headlines.
You can preview the structure using:
curl -s https://www.bbc.com/news | head -n 100
Look for repeated tags or classes. At the time of writing, BBC wraps many headlines in tags like:
<h3 class="gs-c-promo-heading__title">Headline Text</h3>
3. Filtering Headlines Using grep
Now let’s pipe the HTML into grep
to extract only the lines containing that tag:
curl -s https://www.bbc.com/news | grep -oP '<h3 class="gs-c-promo-heading__title.*?>.*?</h3>'
Here’s what’s happening:
-o
: Only outputs the matched text.-P
: Enables Perl-compatible regex, allowing for lazy quantifiers like.*?
.
This returns a list of raw HTML headline elements. For more readable results, let’s strip out the tags.
4. Cleaning the Output: Removing HTML Tags
To extract just the text content, we can run the output through another sed
command:
curl -s https://www.bbc.com/news \
| grep -oP '<h3 class="gs-c-promo-heading__title.*?>.*?</h3>' \
| sed -E 's/<[^>]+>//g'
sed -E 's/<[^>]+>//g'
removes all HTML tags using a basic regular expression.
The final output is a clean list of news headlines scraped directly from the site.
5. Full Script: Bash-Powered Headline Scraper
You can now turn this into a script:
#!/bin/bash
# Scrape top headlines from BBC News
url="https://www.bbc.com/news"
curl -s "$url" \
| grep -oP '<h3 class="gs-c-promo-heading__title.*?>.*?</h3>' \
| sed -E 's/<[^>]+>//g' \
| head -n 10
This script will display the top 10 headlines. Save this as get_headlines.sh
, give it execute permissions with chmod +x get_headlines.sh
, and run it.
6. Tips and Considerations
- Website structure changes: HTML elements and classes can change, breaking your script. Always re-check patterns if results seem off.
- Rate limiting: Don’t hit websites too frequently. Add
sleep
commands or usecron
to schedule scrapes responsibly. - Publication restrictions: Always check a website’s robots.txt and terms of service before scraping their content.
- Alternatives: For more complex scraping, consider tools like
BeautifulSoup
in Python, but for fast one-off jobs, Bash is tough to beat. - Performance: This approach is extremely fast for simple parsing tasks and avoids dependency bloat common in heavier languages.
Conclusion
Bash + cURL + grep is a quick and effective way to scrape structured data from web pages when performance and minimalism matter. It’s perfect for CLI dashboards, logs, automation scripts, or lightweight data analysis. Try adapting this method to other sites or tags—your command-line scraper toolkit has just leveled up!
Useful links: