Scrape Medium Articles by Tag Using Python and BeautifulSoup

Scraping content from websites can be a powerful way to automate data collection for analysis, aggregation, or content discovery. In this tutorial, we’ll build a Python script using requests and BeautifulSoup to scrape article titles and links from Medium tag pages. We’ll also include robust error handling and clean modular code so your solution can scale or adapt to other scraping use cases.

1. Understanding Medium’s Tag URLs and Structure

Medium allows users to browse stories by tags. A tag page URL looks like this: https://medium.com/tag/python. Each tag page displays a list of recently published articles under that tag. These pages are dynamic but still allow basic scraping of article links and titles using static GET requests.

Before we write any code, examine the HTML structure of a tag page using your browser’s DevTools (Right-click → Inspect). Articles are embedded within anchor tags that link to individual article pages.

We’re looking for:

<a href="/some-article-path" ...> ... </a>

Not all <a> tags will be valid articles, so filtering by class or structure is important, depending on layout. Let’s now begin coding.

2. Setting Up the Project (Dependencies and Imports)

Start by creating a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Install the required packages:

pip install requests beautifulsoup4

Now import what we need:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

We’ll use requests for HTTP requests, BeautifulSoup to parse HTML, and urljoin for resolving relative URLs. time can help rate-limit requests.

3. Building the Scraper Logic

The function below fetches a tag page, parses it, and returns a list of article titles and URLs.

def scrape_medium_tag(tag: str, max_articles: int = 10):
    base_url = f"https://medium.com/tag/{tag}"
    headers = {"User-Agent": "Mozilla/5.0"}

    try:
        response = requests.get(base_url, headers=headers)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Error fetching page: {e}")
        return []

    soup = BeautifulSoup(response.text, 'html.parser')
    articles = []

    # Look for all unique article links
    for link_tag in soup.find_all('a', href=True):
        href = link_tag['href']
        # Filter only article links with '/p/' format
        if '/p/' in href and len(articles) < max_articles:
            title = link_tag.get_text(strip=True)
            full_url = urljoin('https://medium.com', href)
            if title and full_url not in [a['url'] for a in articles]:
                articles.append({"title": title, "url": full_url})

    return articles

Key explanations:

We spoof the User-Agent to reduce chance of being blocked.
We filter for Medium article links, which typically include /p/ in their URL path.
We avoid duplicates and empty titles to ensure clean data.

4. Putting It All Together With a CLI Interface

Let’s wrap the script in a basic CLI interface to make it more interactive.

def main():
    tag = input("Enter Medium tag (e.g. python, design): ").strip()
    articles = scrape_medium_tag(tag)

    if not articles:
        print(f"No articles found for tag: {tag}")
        return

    print(f"\nTop {len(articles)} articles for #{tag}:\n")
    for i, article in enumerate(articles, 1):
        print(f"{i}. {article['title']}\n   {article['url']}\n")

if __name__ == '__main__':
    main()

This lets users specify any Medium tag and retrieves up to 10 articles by default. This script can be extended to save results to JSON, CSV, or post-process data as needed for automation tasks.

5. Best Practices: Throttling, Error Resilience, and Future Proofing

As Medium's layout may change, it’s important to build scrapers defensively:

Use try/except blocks around network and parsing operations.
Add rate limiting if doing multiple page fetches: time.sleep(1)
Log errors clearly to debug broken structure later.
Backup HTML pages for reproducibility and re-parsing.
Abstract: Write functions that allow you to adapt easily if Medium changes DOM structure.

In production, consider switching to headless browsers (e.g., Selenium or Playwright) if JavaScript rendering becomes necessary, but for basic extraction, BeautifulSoup remains lightweight and effective.

Conclusion

Congratulations! You've built a functional Medium tag scraper with Python. While we've focused on titles and URLs today, feel free to extend the script to extract publication dates, intros, or even fetch article full text. Scraping responsibly and respectfully—by limiting request frequency and obeying terms of service—is critical.

This kind of script is great for automating content curation, powering search interfaces, or feeding NLP pipelines with real user-written text. Happy scraping!

Useful links: