I just needed a little script to click around a bunch of pages synchronously. It just needed to load the URLs. Not actually do much with the content. Here's what I hacked up:

import random
import requests
from pyquery import PyQuery as pq
from urllib.parse import urljoin

session = requests.Session()
urls = []

def run(url):
    if len(urls) > 100:
    html = session.get(url).decode('utf-8')
        d = pq(html)
    except ValueError:
        # Possibly weird Unicode errors on OSX due to lxml.
    new_urls = []
    for a in d('a[href]').items():
        uri = a.attr('href')
        if uri.startswith('/') and not uri.startswith('//'):
            new_url = urljoin(url, uri)
            if new_url not in urls:
    [run(x) for x in new_urls]


If you want to do this when the user is signed in, go to the site in your browser, open the Network tab on your Web Console and copy the value of the Cookie request header.
Change that session.get(url) to something like:

html = session.get(url, headers={'Cookie': 'sessionid=i49q3o66anhvpdaxgldeftsul78bvrpk'}).decode('utf-8')

Now it can spider bot around on your site for a little while as if you're logged in.

Dirty. Simple. Fast.



Nice! You might also point your readers to the new Requests-HTML library by Kenneth Reitz
- PyQuery & Beautiful Soup based
- Python 3.6 only

Your email will never ever be published.

Related posts