Scrape Any Website with BeautifulSoup and Python Requests

Scrape Any Website with BeautifulSoup and Python Requests

Scrape Any Website with BeautifulSoup and Python Requests

 

Web scraping is a powerful tool for developers who want to automate data collection from web pages. Whether you’re building a price tracker, collecting research material, or building your own dataset, Python provides everything you need with its requests and BeautifulSoup libraries. In this tutorial, you’ll learn how to scrape any static website, parse its HTML, and extract useful data — exporting it to a clean JSON format.

1. Setting Up Your Environment

Before we dive into code, make sure you have the necessary libraries installed. You can use pip to install them:

pip install requests beautifulsoup4

We’ll also use Python’s built-in json library to export our data. Here’s the full setup:

import requests
from bs4 import BeautifulSoup
import json

With this setup, you’re ready to start scraping.

2. Fetching Web Page Content with Requests

The requests library lets you send HTTP requests using Python. Here’s how to fetch the HTML content of a sample website:

URL = 'https://example.com'
response = requests.get(URL)

if response.status_code == 200:
    html = response.text
    print("Page fetched successfully!")
else:
    print(f"Failed to retrieve page: {response.status_code}")

This code fetches the page and checks for a successful response. Always check the status code—many websites rate-limit or block scrapers.

3. Parsing HTML with BeautifulSoup

Once you have the HTML, it’s time to parse it using BeautifulSoup. This library converts raw HTML into a tree of searchable elements, making it easy to extract only the parts you care about.

soup = BeautifulSoup(html, 'html.parser')

# Example: Extract all paragraph text
paragraphs = soup.find_all('p')
for p in paragraphs:
    print(p.text)

Need to extract more structured data like products or headlines? Use element selectors like find() or find_all() with class names, IDs, or tags.

# Get all product titles inside a div with class 'product-title'
titles = soup.find_all('div', class_='product-title')

product_data = [t.text.strip() for t in titles]

print(product_data)

Tip: Use browser developer tools (Inspect Element) to find exact tag and class names.

4. Exporting the Data to JSON

To make the extracted data reusable, serialize it to JSON using Python’s json module. JSON is perfect for APIs, data pipelines, or even local storage.

# Let's wrap our scraping code in a function
def scrape_titles(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    titles = soup.find_all('div', class_='product-title')
    return [t.text.strip() for t in titles]

output_data = {
    "source": URL,
    "titles": scrape_titles(URL)
}

# Write to a file
with open('titles.json', 'w') as f:
    json.dump(output_data, f, indent=4)

print("Data saved to titles.json")

This saves your data in a structured way that’s easy to work with in other tools or programming languages.

5. Real-World Use Cases and Tips

Here are some practical scenarios where web scraping saves time and automates workflows:

  • Market Research: Gather product data from competitors’ pages
  • Job Aggregators: Collect job listings from public websites
  • Content Curation: Extract headlines or article summaries for newsletters

Here are a few performance and reliability tips:

  • Use headers to mimic browser behavior: {'User-Agent': 'Mozilla/5.0 ...'}
  • Use time.sleep() between requests to avoid being blocked
  • Use try/except blocks to handle network errors gracefully

Example with headers and delay:

import time
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(URL, headers=headers)

# polite delay
time.sleep(2)

For more complex pages loaded with JavaScript, consider using selenium or requests-html.

Conclusion

Scraping websites using Python’s requests and BeautifulSoup is a powerful and accessible way to automate data collection. By fetching raw HTML, parsing it intelligently, and exporting it in a structured format like JSON, you can build automation scripts, data enrichment pipelines, and much more.

Just remember to follow each site’s robots.txt guidelines and terms of service to ensure you’re scraping responsibly.

 

Useful links: