“Set It and Forget It” Web Scraper in Python Using schedule + requests

“Set It and Forget It” Web Scraper in Python Using schedule + requests

“Set It and Forget It” Web Scraper in Python Using schedule + requests

 

Web scraping has become an essential tool for automation, data monitoring, and real-time notification systems. Whether you’re tracking job postings, monitoring product prices, or collecting data for market research, Python’s flexibility makes it easy to build a smart, periodic scraper that runs quietly in the background.

In this post, we’ll show how to build a lightweight Python scraper that launches itself on a schedule using two powerful Python libraries: requests for fetching web pages and schedule for task scheduling. You’ll learn how to build, test, and deploy your own “set it and forget it” scraper in under 100 lines of code.

1. Setting Up the Environment

Before writing any code, make sure you have the required libraries installed. You’ll need:

pip install requests
pip install schedule

We’ll also use BeautifulSoup from bs4 for HTML parsing:

pip install beautifulsoup4

Let’s import these in Python:

import requests
from bs4 import BeautifulSoup
import schedule
import time

With these three libraries, you can build scrapers, parse HTML, and schedule your scrapers to run automatically on a daily or hourly basis.

2. Writing a Simple Scraper Function

Let’s say you want to monitor job postings on a sample listing site like “https://example.com/jobs”. Here’s a simple scraping function to extract job titles:

def scrape_job_listings():
    url = "https://example.com/jobs"
    headers = {"User-Agent": "Mozilla/5.0"}

    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
    
        soup = BeautifulSoup(response.text, 'html.parser')
        jobs = soup.find_all('h2', class_='job-title')

        print(f"[scrape_job_listings] Found {len(jobs)} jobs:")
        for job in jobs:
            print(f"- {job.text.strip()}")

    except requests.RequestException as e:
        print("[Error] Failed to fetch data:", e)

This function requests the webpage content, parses it, and extracts job titles encapsulated by <h2 class="job-title">. It prints each job found and handles any network errors gracefully.

3. Automating Execution with schedule

Now that we have a working scraper function, it’s time to automate it. The schedule library helps you run Python functions at defined intervals.

You can schedule the scraper to run every 10 minutes or every day at a specific time:

# Run every 10 minutes
schedule.every(10).minutes.do(scrape_job_listings)

# Or, once a day at 09:00 AM
# schedule.every().day.at("09:00").do(scrape_job_listings)

Now add a simple loop to keep the job running indefinitely in the background:

while True:
    schedule.run_pending()
    time.sleep(1)

This is the core of your always-on, cron-like scraping engine in pure Python — no external services or cron setup required.

4. Making It Robust: Logging and Alerting

To make this tool more production-ready, consider adding logging and optional alerting when conditions are met. For example, let’s alert when certain keywords show up in a job title:

import logging

logging.basicConfig(filename='scraper.log', level=logging.INFO,
                    format='%(asctime)s - %(message)s')
                    
def scrape_job_listings():
    keywords = ["Python", "Data", "Remote"]
    url = "https://example.com/jobs"
    headers = {"User-Agent": "Mozilla/5.0"}
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        jobs = soup.find_all('h2', class_='job-title')

        for job in jobs:
            title = job.text.strip()
            if any(keyword.lower() in title.lower() for keyword in keywords):
                logging.info(f"Matching job found: {title}")
                # You could also send an email or a Slack message here

    except requests.RequestException as e:
        logging.error(f"Failed to fetch data: {e}")

Good logging ensures you can track what happened if the script fails or behaves unexpectedly — a vital step for long-term, automated processes.

5. Running the Scraper as a Background Service

To make your scraper act like a true background service, use system tools like nohup or run it as a systemd service on Linux. Here’s how to run it on the command line:

nohup python scraper.py &

This detaches the process and lets it run even if the terminal closes. For Windows users, consider using pythonw.exe or setting up a scheduled task using Task Scheduler.

For deployment to cloud environments, light VPS services like DigitalOcean or Heroku worker dynos can host your script around the clock. Alternatively, containerize your scraper with Docker for consistency across environments.

Conclusion

You now have all the tools to create a robust, scheduled web scraper that fetches data periodically and runs unattended. This architecture works great for job boards, stock tickers, news headlines, and any site where new content is worth monitoring. With a few adjustments — like result caching, deduplication, or database storage — you can evolve this simple script into a powerful data automation pipeline.

And the best part? Once setup is complete, you can truly set it and forget it.

 

Useful links: