Worlds simplest web scraper bot in Python

Friday, Mar 16, 2018
1 comment Python

I just needed a little script to click around a bunch of pages synchronously. It just needed to load the URLs. Not actually do much with the content. Here's what I hacked up:


import random
import requests
from pyquery import PyQuery as pq
from urllib.parse import urljoin


session = requests.Session()
urls = []


def run(url):
    if len(urls) > 100:
        return
    urls.append(url)
    html = session.get(url).decode('utf-8')
    try:
        d = pq(html)
    except ValueError:
        # Possibly weird Unicode errors on OSX due to lxml.
        return
    new_urls = []
    for a in d('a[href]').items():
        uri = a.attr('href')
        if uri.startswith('/') and not uri.startswith('//'):
            new_url = urljoin(url, uri)
            if new_url not in urls:
                new_urls.append(new_url)
    random.shuffle(new_urls)
    [run(x) for x in new_urls]

run('http://localhost:8000/')

If you want to do this when the user is signed in, go to the site in your browser, open the Network tab on your Web Console and copy the value of the Cookie request header.
Change that session.get(url) to something like:


html = session.get(url, headers={'Cookie': 'sessionid=i49q3o66anhvpdaxgldeftsul78bvrpk'}).decode('utf-8')

Now it can spider bot around on your site for a little while as if you're logged in.

Dirty. Simple. Fast.

Comments

Anonymous March 17, 2018

Nice! You might also point your readers to the new Requests-HTML library by Kenneth Reitz
http://html.python-requests.org/
- PyQuery & Beautiful Soup based
- Python 3.6 only

Previous:: filterToQueryString - JavaScript function to turn current filter into a query string March 15, 2018 Web development, React, JavaScript
Next:: hashin 0.12.0 is much much faster March 20, 2018 Python

Related by category:: Native connection pooling in Django 5 with PostgreSQL June 25, 2025 Python; A Python dict that can report which keys you did not use June 12, 2025 Python; How I run standalone Python in 2025 January 14, 2025 Python; How to resolve a git conflict in poetry.lock February 7, 2020 Python

Related by keyword:: Fastest way to turn HTML into text in Python January 8, 2021 Python

Worlds simplest web scraper bot in Python

Comments

Related posts