Tech Tutorial - February 23 2026 173006
TOP 5 Feb. 23, 2026, 5:30 p.m.

Tech Tutorial - February 23 2026 173006

Welcome to today’s deep‑dive on asynchronous web scraping with Python. In a world where data moves at lightning speed, traditional blocking scrapers can become bottlenecks, especially when you need to crawl hundreds of pages per minute. By the end of this tutorial you’ll understand why async I/O matters, how to set up a robust scraper, and how to scale it for real‑world projects like price monitoring or news aggregation. Grab a cup of coffee, fire up your favorite IDE, and let’s start turning latency into opportunity.

Why Asynchronous Scraping Matters

When you request a web page, the majority of the time is spent waiting for the server to respond, not processing the HTML. In a synchronous loop, each request blocks the next, turning a 10‑second wait into a 10‑second delay per URL. Asynchronous programming lets you fire off many requests concurrently, keeping the CPU busy while the network does the heavy lifting. The result? You can fetch dozens or even hundreds of pages in the time it used to take to fetch a single one.

Beyond raw speed, async scrapers are more resource‑efficient. They consume fewer threads, reducing memory overhead and avoiding the dreaded “too many open files” error that plagues thread‑based crawlers. Moreover, modern APIs like aiohttp integrate seamlessly with asyncio, giving you fine‑grained control over timeouts, retries, and connection pooling—all without the complexity of managing a thread pool.

Core Concepts to Grasp

Before diving into code, familiarize yourself with three key ideas: the event loop, coroutines, and tasks. The event loop is the engine that schedules and runs coroutines—functions defined with async def that can pause execution at await points. Tasks are wrappers that tell the loop to execute a coroutine concurrently. Understanding how these pieces fit together will make debugging far easier and will empower you to design more flexible pipelines.

Another essential concept is back‑pressure handling. Websites often enforce rate limits or CAPTCHAs when they detect aggressive scraping. By leveraging asyncio.Semaphore you can throttle concurrent connections, ensuring you stay polite while still reaping the benefits of concurrency.

Setting Up the Environment

First, create an isolated virtual environment to keep dependencies tidy. Open a terminal and run:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`
pip install --upgrade pip

Next, install the core async libraries we’ll use throughout the tutorial:

pip install aiohttp beautifulsoup4 lxml

Optional but highly recommended is uvloop, a drop‑in replacement for the default event loop that can boost performance by up to 30% on Linux:

pip install uvloop

Installing Dependencies

Once the packages are installed, verify the installation by launching a quick REPL session:

>>> import aiohttp, asyncio, bs4
>>> print(aiohttp.__version__, bs4.__version__)
3.9.5 4.12.2

If you see version numbers without errors, you’re ready to start coding. Remember to keep your requirements.txt up to date; this habit saves countless headaches when moving the project to CI/CD pipelines or cloud environments.

Building the First Async Scraper

Let’s build a minimal scraper that fetches a list of URLs and extracts the page title. The code below demonstrates a clean separation between the networking layer and the parsing logic, which makes testing and future extensions straightforward.

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch(session, url):
    async with session.get(url, timeout=10) as response:
        response.raise_for_status()
        return await response.text()

async def parse(html):
    soup = BeautifulSoup(html, "lxml")
    return soup.title.string.strip() if soup.title else "No title"

async def scrape(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [asyncio.create_task(fetch(session, url)) for url in urls]
        pages = await asyncio.gather(*tasks, return_exceptions=True)

        results = []
        for content in pages:
            if isinstance(content, Exception):
                results.append(str(content))
            else:
                results.append(await parse(content))
        return results

if __name__ == "__main__":
    url_list = [
        "https://www.python.org",
        "https://realpython.com",
        "https://news.ycombinator.com"
    ]
    titles = asyncio.run(scrape(url_list))
    for url, title in zip(url_list, titles):
        print(f"{url} → {title}")

This script does three things: opens a shared ClientSession, launches a coroutine per URL, and parses the HTML to pull the <title> tag. Notice the use of asyncio.gather with return_exceptions=True – it ensures that a single failing request doesn’t abort the entire batch.

Handling Rate Limits and Politeness

Scraping without respecting a site’s rate limits can land you on a blacklist or trigger legal warnings. A simple yet effective strategy is to wrap your fetch calls in a semaphore that caps the number of simultaneous connections.

MAX_CONCURRENT = 5
semaphore = asyncio.Semaphore(MAX_CONCURRENT)

async def safe_fetch(session, url):
    async with semaphore:
        return await fetch(session, url)

Replace the original fetch call in scrape with safe_fetch. You can also inject random delays using await asyncio.sleep(random.uniform(0.5, 1.5)) to mimic human browsing patterns.

Pro tip: Combine a semaphore with exponential back‑off for retries. If a request fails with a 429 (Too Many Requests), wait 2ⁿ seconds before retrying, where n is the retry attempt count. This dramatically reduces the chance of being blocked.

Scaling Up: Crawling Multiple Domains

Real‑world projects rarely scrape a static list of URLs. Instead, they discover new links on the fly, follow pagination, and often need to respect robots.txt rules. Let’s extend our scraper to recursively crawl a domain up to a configurable depth.

import re
from urllib.parse import urljoin, urlparse

async def extract_links(html, base_url):
    soup = BeautifulSoup(html, "lxml")
    links = set()
    for a_tag in soup.find_all("a", href=True):
        href = a_tag["href"]
        # Normalize relative URLs
        full_url = urljoin(base_url, href)
        # Filter out mailto, javascript, etc.
        if re.match(r'^https?://', full_url):
            links.add(full_url)
    return links

async def crawl(session, start_url, max_depth=2, visited=None):
    if visited is None:
        visited = set()
    if start_url in visited or max_depth < 0:
        return
    visited.add(start_url)

    try:
        html = await safe_fetch(session, start_url)
        print(f"Crawled: {start_url}")
        links = await extract_links(html, start_url)
        tasks = [
            asyncio.create_task(crawl(session, link, max_depth - 1, visited))
            for link in links
        ]
        await asyncio.gather(*tasks, return_exceptions=True)
    except Exception as e:
        print(f"Error on {start_url}: {e}")

async def main_crawl(seed):
    async with aiohttp.ClientSession() as session:
        await crawl(session, seed, max_depth=2)

if __name__ == "__main__":
    asyncio.run(main_crawl("https://news.ycombinator.com"))

The crawl function tracks visited URLs to avoid infinite loops and respects a maximum depth to prevent runaway recursion. By reusing the same ClientSession and semaphore, you keep the network footprint low while still harvesting a rich link graph.

Storing Results Efficiently

Scraped data is only as valuable as the way you store and query it later. For lightweight projects, a CSV or JSON lines file suffices. However, production‑grade pipelines often push data into a document store like MongoDB or an analytics‑ready warehouse such as Snowflake.

  • JSON Lines: Easy to append, line‑delimited, works well with jq or Pandas.
  • SQLite: Zero‑configuration relational DB, perfect for modest datasets.
  • MongoDB: Schema‑flexible, supports rich queries on nested fields.
  • Kafka + ClickHouse: For high‑throughput streaming ingestion and real‑time analytics.

Below is a quick example that writes each page’s URL and title to a JSON Lines file, ensuring the operation is non‑blocking by using aiofiles:

import aiofiles
import json

async def write_jsonl(path, records):
    async with aiofiles.open(path, mode='a') as f:
        for rec in records:
            line = json.dumps(rec) + "\n"
            await f.write(line)

# Usage inside scrape()
results = [{"url": u, "title": t} for u, t in zip(urls, titles)]
await write_jsonl("output.jl", results)

Real‑World Use Case: Monitoring E‑Commerce Prices

Imagine you run a price‑comparison service that needs to track product costs across dozens of online retailers every hour. An async scraper can fetch all product pages concurrently, parse the price element, and push the data into a time‑series database for trend analysis.

import re
from datetime import datetime

PRICE_REGEX = re.compile(r'\$?(\d{1,3}(?:,\d{3})*(?:\.\d{2})?)')

async def extract_price(html):
    soup = BeautifulSoup(html, "lxml")
    # Common patterns: span.price, div[data-price], etc.
    price_tag = soup.select_one('span.price, div[data-price]')
    if not price_tag:
        return None
    match = PRICE_REGEX.search(price_tag.get_text())
    return float(match.group(1).replace(',', '')) if match else None

async def monitor_prices(product_urls):
    async with aiohttp.ClientSession() as session:
        tasks = [safe_fetch(session, url) for url in product_urls]
        pages = await asyncio.gather(*tasks, return_exceptions=True)

        records = []
        for url, html in zip(product_urls, pages):
            if isinstance(html, Exception):
                continue
            price = await extract_price(html)
            if price is not None:
                records.append({
                    "url": url,
                    "price": price,
                    "timestamp": datetime.utcnow().isoformat()
                })
        await write_jsonl("prices.jl", records)

if __name__ == "__main__":
    urls = [
        "https://example.com/product/123",
        "https://shop.example.org/item/456",
        # Add more product URLs here
    ]
    asyncio.run(monitor_prices(urls))

This snippet demonstrates how to isolate price extraction logic, handle missing elements gracefully, and store a timestamped record for each successful fetch. Coupled with a scheduler like cron or APScheduler, you can run this scraper every hour and feed the output into Grafana for visual monitoring.

Deploying to Production

When you move from a local script to a production service, consider containerizing the scraper with Docker. A typical Dockerfile would install the dependencies, copy the source code, and set the entrypoint to python -m my_scraper. Use Kubernetes Jobs or AWS Batch to run the scraper on a schedule, ensuring that each run gets a fresh environment and isolated resources.

Pro tip: Enable HTTP/2 support in aiohttp by passing connector=aiohttp.TCPConnector(force_close=True). HTTP/2 multiplexes multiple streams over a single TCP connection, further reducing latency for domains that support it.

Common Pitfalls and Debugging

Even seasoned developers stumble over a few recurring issues when building async scrapers. Below is a quick checklist to help you diagnose problems before they become show‑stoppers.

  1. Forgot to await a coroutine – This results in a RuntimeWarning: coroutine '...' was never awaited. Always use await or wrap the coroutine in asyncio.create_task.
  2. Blocking I/O in an async function – Calls to time.sleep or heavy CPU work will block the event loop. Replace them with await asyncio.sleep or offload CPU‑bound tasks to a thread pool via run_in_executor.
  3. Unclosed ClientSession – Not using an async context manager can leak sockets. Ensure ClientSession is wrapped in async with or explicitly closed.
  4. SSL verification errors – Some sites use self‑signed certificates. While you can disable verification with ssl=False, it’s safer to provide a custom SSLContext that trusts the required CA.
  5. Rate‑limit bans – If you receive 429 or 503 repeatedly, back off aggressively and respect Retry-After headers.

For deeper inspection, the aiohttp logger can be configured to emit debug information:

import logging
logging.basicConfig(level=logging.DEBUG)
aiohttp_logger = logging.getLogger("aiohttp.client")
aiohttp_logger.setLevel(logging.DEBUG)

These logs reveal request headers, connection reuse, and retry attempts, making it easier to pinpoint where things go awry.

Conclusion

Asynchronous web scraping unlocks a new level of performance, allowing you to gather data at scale while keeping resource consumption low. By mastering asyncio, aiohttp, and best‑practice patterns like semaphores, exponential back‑off, and non‑blocking file I/O, you’ll be equipped to build resilient pipelines for everything from price monitoring to news aggregation.

Share this article