I just needed a little script to click around a bunch of pages synchronously. It just needed to load the URLs. Not actually do much with the content. Here's what I hacked up:
import random
import requests
from pyquery import PyQuery as pq
from urllib.parse import urljoin
session = requests.Session()
urls = []
def run(url):
if len(urls) > 100:
return
urls.append(url)
html = session.get(url).decode('utf-8')
try:
d = pq(html)
except ValueError:
# Possibly weird Unicode errors on OSX due to lxml.
return
new_urls = []
for a in d('a[href]').items():
uri = a.attr('href')
if uri.startswith('/') and not uri.startswith('//'):
new_url = urljoin(url, uri)
if new_url not in urls:
new_urls.append(new_url)
random.shuffle(new_urls)
[run(x) for x in new_urls]
run('http://localhost:8000/')
If you want to do this when the user is signed in, go to the site in your browser, open the Network tab on your Web Console and copy the value of the Cookie
request header.
Change that session.get(url)
to something like:
html = session.get(url, headers={'Cookie': 'sessionid=i49q3o66anhvpdaxgldeftsul78bvrpk'}).decode('utf-8')
Now it can spider bot around on your site for a little while as if you're logged in.
Dirty. Simple. Fast.
Comments
Nice! You might also point your readers to the new Requests-HTML library by Kenneth Reitz
http://html.python-requests.org/
- PyQuery & Beautiful Soup based
- Python 3.6 only