I just needed a little script to click around a bunch of pages synchronously. It just needed to load the URLs. Not actually do much with the content. Here's what I hacked up:


import random
import requests
from pyquery import PyQuery as pq
from urllib.parse import urljoin


session = requests.Session()
urls = []


def run(url):
    if len(urls) > 100:
        return
    urls.append(url)
    html = session.get(url).decode('utf-8')
    try:
        d = pq(html)
    except ValueError:
        # Possibly weird Unicode errors on OSX due to lxml.
        return
    new_urls = []
    for a in d('a[href]').items():
        uri = a.attr('href')
        if uri.startswith('/') and not uri.startswith('//'):
            new_url = urljoin(url, uri)
            if new_url not in urls:
                new_urls.append(new_url)
    random.shuffle(new_urls)
    [run(x) for x in new_urls]

run('http://localhost:8000/')

If you want to do this when the user is signed in, go to the site in your browser, open the Network tab on your Web Console and copy the value of the Cookie request header.
Change that session.get(url) to something like:


html = session.get(url, headers={'Cookie': 'sessionid=i49q3o66anhvpdaxgldeftsul78bvrpk'}).decode('utf-8')

Now it can spider bot around on your site for a little while as if you're logged in.

Dirty. Simple. Fast.

Comments

Anonymous

Nice! You might also point your readers to the new Requests-HTML library by Kenneth Reitz
http://html.python-requests.org/
- PyQuery & Beautiful Soup based
- Python 3.6 only

Your email will never ever be published.

Related posts