Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under the category: Django. Clear filter

Podcasttime.io - How Much Time Do Your Podcasts Take To Listen To?

13 February 2017 3 comments   ReactJS, Javascript, Django, Web development, Python

https://podcasttime.io/about


tl;dr; It's a web app where you search and find the podcasts you listen to. It then gives you a break down how much time that requires to keep up, per day, per week and per month. Podcasttime.io

Podcasttime.io on Firefox iOS
First I wrote some scripts to scrape various sources of podcasts. This is basically a RSS feed URL from which you can fetch the name and an image. And with some cron jobs you can download and parse each podcast feed and build up an index of how many episodes they have and how long each episode is. Together with each episodes "publish date" you can easily figure out an average of how much content each podcast puts out over time.

Suppose you listen to JavaScript Air, Talk Python To Me and Google Cloud Platform Podcast for example, that means you need to listen to podcasts for about 8 minutes per day to keep up.

The Back End

The technology is exciting. The backend is a Django 1.10 server. It manages a PostgreSQL database of all the podcasts, episodes, cron jobs etc. Through Django ORM signals is packages up each podcast with its metadata and stores it in an Elasticsearch database. All the communication between Django and ElasticSearch is done with Elasticsearch DSL.

Also, all the downloading and parsing of feeds is done as background tasks in Celery. This got really interesting/challenging because sooo many podcasts are poorly marked up and many a times the only way to find out how long an episode is is to use ffmpeg to probe it and that takes time.

Another biggish challenge is that fact that often things simply don't work because of networks being what they are, unreliable. So you have to re-attempt network calls without accidentally getting caught in infinite loops of accidentally putting a bad/broken RSS feed back into the background queue again and again and again.

The Front End

Actually, the first prototype of this app was written with Django as the front end plus some jQuery to tie things together. On a plane ride, and as an excuse to learn it, I re-wrote the whole thing in React with Redux. To be honest, I never really enjoyed that and it felt like everything was hard and I had to do more jumping-around-files than actual coding. In particular, Redux is nice but when you have a lot of AJAX both inside components and upon mounting it gets quite messy in my humble opinion.

So, on another plane ride (to Hawaii, so I had more time) I re-wrote it from scratch but this time using three beautiful pieces of front end technology: create-react-app, Mobx and mobx-router. Suddenly it became fun again. Mobx (or Redux or something "fluxy") is necessary if you want fancy pushState URLs AND a central (aka global) state management.

To be perfectly honest, I never actually tried combining Mobx with something like react-router or if it's even possible. But with mobx-router it's quite neat. You write a "views route map" (see example) where you can kick off AJAX before entering (and leaving) routes. Then you use that to populate a global store and now all components can be almost entirely about simply rendering the store. There is some AJAX within the mounted components (e.g. the search and autocomplete).

Plotly graph
On the home page, there's a chart that rather unscientifically plots episode durations over time as a line chart. I'm trying a library called Plotly which is actually a online app for building charts but they offer a free JavaScript library too for generating graphs. Not entirely sure how I feel about it yet but apart from looking a big crowded on mobile, it's working really well.

A Killer Feature

This is a pattern I've wanted to build but never managed to get right. The way to get data about a podcast (and its episodes) is to do an Elasticsearch search. From the homepage you basically call /find?q=Planet%20money when you search. That gives you almost all the information you need. So you store that in the global store. Then, if the user clicks on that particular podcast to go to its "perma page" you can simply load that podcast's individual route and you don't need to do something like /find?id=727 because you already have everything you need. If the user then opens that page in a new tab or reloads you now have to fetch just the one podcast, so you simply call /find?id=727. In other words, subsequent page loads load instantly! (Basically, it updates the store's podcast object upon clicking any of the podcasts iterated over from the listing. Code here)

And to top that - and this is where a good router shines - if you make a search or something, click something and click back since you have a global store of state, you can simply reuse that without needing another AJAX query.

The State of the Future

First of all, this is a fun little side project and it's probably buggy. My goal is not to make money on it but to build up a graph. Every time someone uses the site and finds the podcasts they listen to that slowly builds up connections. If you listen to "The Economist", "Planet Money" and "Freakonomics", that tie those together loosely. It's hard to programmatically know that those three podcasts are "related" but they are by "peoples' taste".

The ultimate goal of this is; now I can recommend other podcasts based on a given set. It's a little bit like LastFM used to work. Using Audioscrobbler LastFM was able to build up a graph based on what people preferred to listen to and using that network of knowledge they can recommend things you have not listened to but probably would appreciate.

At the moment, there's a simple Picks listing of "lists" (aka "picks") that people have chosen. With enough time and traffic I'll try to use Elasticsearch's X-Pack Graph capabilities to develop a search engine based on this.

At the time of writing, I've indexed 4,669 podcasts, spanning 611,025 episodes which equates to 549,722 hours of podcast content.

The Code

The front end code is available on github.com/peterbe/podcasttime2 and is relatively neat and tidy. The most interesting piece is probably the views/index.js which is the "controller" of things. That's where it decides which component to render, does the AJAX queries and manages the global store.

The back end code is a bit messier. It's done as an "app" as part of this very blog. The way the Elasticsearch indexing is configured is here and the hotch potch code for scraping and parsing RSS feeds is here.

Please try it out and show me your selection. You can drop feedback here.

Using Fanout.io in Django

13 December 2016 0 comments   Python, Web development, Django, Mozilla, Javascript


Earlier this year we started using Fanout.io in Air Mozilla to enhance the experience for users awaiting content updates. Here I hope to flesh out its details a bit to inspire others to deploy a similar solution.

What It Is

First of all, Fanout.io is basically a service that handles your WebSockets. You put in some of Fanout's JavaScript into your site that handles a persistent WebSocket connection between your site and Fanout.io. And to push messages to your user you basically send them to Fanout.io from the server and they "forward" it to the WebSocket.

The HTML page looks like this:

<html>
<body>

  <h1>Web Page</h1>

<!-- replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page -->
<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
></script>
<script src="fanout.js"></script>
</body>
</html>

And the fanout.js script looks like this:

window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
     console.log('Incoming updated data from the server:', data);
  })
};

And in server it looks something like this:

from django.conf import settings
import fanout

fanout.realm = settings.FANOUT_REALM_ID
fanout.key = settings.FANOUT_REALM_KEY


def post_comment(request):
    """A django view function that saves the posted comment"""
   text = request.POST['comment']
   saved_comment = Comment.objects.create(text=text, user=request.user)
   fanout.publish('mycomments', {'new_comment': saved_comment.id})
   return http.JsonResponse({'comment_posted': True})

Note that, in the client-side code, there's no security since there's no authentication. Any client can connect to any channel. So it's important that you don't send anything sensitive. In fact, you should think of this pattern simply as a hint that something has changed. For example, here's a slightly more fleshed out example of how you'd use the subscription.

window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
    if (data.new_comment) {
      // server says a new comment has been posted in the server
      $.json('/comments', function(response) {
        $('#comments .comment').remove();
        $.each(response.comments, function(comment) {        
          $('<div class="comment">')
          .append($('<p>').text(comment.text))
          .append($('<span>').text('By: ' + comment.user.name))
          .appendTo('#comments');
        });
      });
    }
  })
};

Yes, I know jQuery isn't hip but it demonstrates the pattern well. Also, in the real world you might not want to ask the server for all comments (and re-render) but instead do an AJAX query to get all new comments since some parameter or something.

Why It's Awesome

It's awesome because you can have a simple page that updates near instantly when the server's database is updated. The alternative would be to do a setInterval loop that frequently does an AJAX query to see if there's new content to update. This is cumbersome because it requires a lot heavier AJAX queries. You might want to make it secure so you engage sessions that need to be looked up each time. Or, since you're going to request it often you have to write a very optimized server-side endpoint that is cheap to query often.

And last but not least, if you rely on an AJAX loop interval, you have to pick a frequency that your server can cope with and it's likely to be in the range of several seconds or else it might overload the server. That means that updates are quite delayed.

But maybe most important, you don't need to worry about running a WebSocket server. It's not terribly hard to do one yourself on your laptop with a bit of Node Express or Tornado but now you have yet another server to maintain and it, internally, needs to be connected to a "pub-sub framework" like Redis or a full blown message queue.

Alternatives

Fanout.io is not the only service that offers this. The decision to use Fanout.io was taken about a year ago and one of the attractive things it offers is that it's got a freemium option which is ideal for doing local testing. The honest truth is that I can't remember the other justifications used to chose Fanout.io over its competitors but here are some alternatives that popped up on a quick search:

It seems they all (including Fanout.io) has freemium plans, supports authentication, REST APIs (for sending and for querying connected clients' stats).

There are also some more advanced feature packed solutions like Meteor, Firebase and GunDB that act more like databases that are connected via WebSockets or alike. For example, you can have a database as a "conduit" for pushing data to a client. Meaning, instead of sending the data from the server directly you save it in a database which syncs to the connected clients.

Lastly, I've heard that Heroku has a really neat solution that does something similar whereby it sets up something similar as an extension.

Let's Get Realistic

The solution sketched out above is very simplistic. There are a lot more fine-grained details that you'd probably want to zoom in to if you're going to do this properly.

Throttling

In Air Mozilla, we call fanout.publish(channel, message) from a post_save ORM signal. If you have a lot of saves for some reason, you might be sending too many messages to the client. A throttling solution, per channel, simply makes sure your "callback" gets called only once per channel per small time frame. Here's the solution we employed:

window.Fanout = (function() {
  var _locks = {};
  return {
    subscribe: function subscribe(channel, callback) {
      _client.subscribe(channel, function(data) {
          if (_locks[channel]) {
              // throttled
              return;
          }
          _locks[channel] = true;
          callback(data);
          setTimeout(function() {
              _locks[channel] = false;
          }, 500);
      });        
    };
  }
})();

Subresource Integrity

Subresource integrity is an important web security technique where you know in advance a hash of the remote JavaScript you include. That means that if someone hacks the result of loading https://cdn.example.com/somelib.js the browser compares the hash of that with a hash mentioned in the <script> tag and refuses to load it if the hash doesn't match.

In the example of Fanout.io it actually looks like this:

<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
  crossOrigin="anonymous"
  integrity="sha384-/9uLm3UDnP3tBHstjgZiqLa7fopVRjYmFinSBjz+FPS/ibb2C4aowhIttvYIGGt9"
></script>

The SHA you get from the Fanout.io documentation. It requires, and implies, that you need to use an exact version of the library. You can't use it like this: <script src="https://cdn.example/somelib.latest.min.js" ....

WebSockets vs. Long-polling

Fanout.io's JavaScript client follows a pattern that makes it compatible with clients that don't support WebSockets. The first technique it uses is called long-polling. With this the server basically relys on standard HTTP techniques but the responses are long lasting instead. It means the request simply takes a very long time to respond and when it does, that's when data can be passed.

This is not a problem for modern browsers. They almost all support WebSocket but you might have an application that isn't a modern browser.

Anyway, what Fanout.io does internally is that it first creates a long-polling connection but then shortly after tries to "upgrade" to WebSockets if it's supported. However, the projects I work only need to support modern browsers and there's a trick to tell Fanout to go straight to WebSockets:

var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux', {
    // What this means is that we're opting to have
    // Fanout *start* with fancy-pants WebSocket and
    // if that doesn't work it **falls back** on other
    // options, such as long-polling.
    // The default behaviour is that it starts with
    // long-polling and tries to "upgrade" itself
    // to WebSocket.
    transportMode: 'fallback'
});

Fallbacks

In the case of Air Mozilla, it already had a traditional solution whereby it does a setInterval loop that does an AJAX query frequently.

Because the networks can be flaky or because something might go wrong in the client, the way we use it is like this:

var RELOAD_INTERVAL = 5;  // seconds

if (typeof window.Fanout !== 'undefined') {
    Fanout.subscribe('/' + container.data('subscription-channel-comments'), function(data) {
        // Supposedly the comments have changed.
        // For security, let's not trust the data but just take it
        // as a hint that it's worth doing an AJAX query
        // now.
        Comments.load(container, data);
    });
    // If Fanout doesn't work for some reason even though it
    // was made available, still use the regular old
    // interval. Just not as frequently.
    RELOAD_INTERVAL = 60 * 5;
}
setInterval(function() {
    Comments.reload_loop(container);
}, RELOAD_INTERVAL * 1000);

Use Fanout Selectively/Progressively

In the case of Air Mozilla, there are lots of pages. Some don't ever need a WebSocket connection. For example, it might be a simple CRUD (Create Update Delete) page. So, for that I made the whole Fanout functionality "lazy" and it only gets set up if the page has some JavaScript that knows it needs it.

This also has the benefit that the Fanout resource loading etc. is slightly delayed until more pressing things have loaded and the DOM is ready.

You can see the whole solution here. And the way you use it here.

Have Many Channels

You can have as many channels as you like. Don't create a channel called comments when you can have a channel called comments-123 where 123 is the ID of the page you're on for example.

In the case of Air Mozilla, there's a channel for every single page. If you're sitting on a page with a commenting widget, it doesn't get WebSocket messages about newly posted comments on other pages.

Conclusion

We've now used Fanout for almost a year in our little Django + jQuery app and it's been great. The management pages in Air Mozilla use AngularJS and the integration looks like this in the event manager page:

window.Fanout.subscribe('/events', function(data) {
    $scope.$apply(lookForModifiedEvents);
});

Fanout.io's been great to us. Really responsive support and very reliable. But if I were to start a fresh new project that needs a solution like this I'd try to spend a little time to investigate the competitors to see if there are some neat features I'd enjoy.

UPDATE

Fanout reached out to help explain more what's great about Fanout.io

"One of Fanout's biggest differentiators is that we use and promote open technologies/standards. For example, our service supports the open Bayeux protocol, and you can connect to it with any compatible client library, such as Faye. Nearly all competing services have proprietary protocols. This "open" aspect of Fanout aligns pretty well with Mozilla's values, and in fact you'd have a hard time finding any alternative that works the same way."

Optimization of QuerySet.get() with or without select_related

03 November 2016 1 comment   Python, Django, PostgreSQL


If you know you're going to look up a related Django ORM object from another one, Django automatically takes care of that for you.

To illustrate, imaging a mapping that looks like this:

class Artist(models.Models):
    name = models.CharField(max_length=200)
    ...

class Song(models.Models):
    artist = models.ForeignKey(Artist)
    ...

And with that in mind, suppose you do this:

>>> Song.objects.get(id=1234567).artist.name
'Frank Zappa'

Internally, what Django does is that it looks the Song object first, then it does a look up automatically on the Artist. In PostgreSQL it looks something like this:

SELECT "main_song"."id", "main_song"."artist_id", ... FROM "main_song" WHERE "main_song"."id" = 1234567
SELECT "main_artist"."id", "main_artist"."name", ... FROM "main_artist" WHERE "main_artist"."id" = 111

Pretty clear. Right.

Now if you know you're going to need to look up that related field you can ask Django to make a join before the lookup even happens. It looks like this:

>>> Song.objects.select_related('artist').get(id=1234567).artist.name
'Frank Zappa'

And the SQL needed looks like this:

SELECT "main_song"."id", ... , "main_artist"."name", ... 
FROM "main_song" INNER JOIN "main_artist" ON ("main_song"."artist_id" = "main_artist"."id") WHERE "main_song"."id" = 1234567

The question is; which is fastest?

Well, there's only one way to find out and that is to measure with some relatistic data.

Here's the benchmarking code:

def f1(id):
    try:
        return Song.objects.get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def f2(id):
    try:
        return Song.objects.select_related('artist').get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def _stats(r):
    #returns the median, average and standard deviation of a sequence
    tot = sum(r)
    avg = tot/len(r)
    sdsq = sum([(i-avg)**2 for i in r])
    s = list(r)
    s.sort()
    return s[len(s)//2], avg, (sdsq/(len(r)-1 or 1))**.5

times = defaultdict(list)
functions = [f1, f2]
for id in range(100000, 103000):
    for f in functions:
        t0 = time.time()
        r = f(id)
        t1 = time.time()
        if r:
            times[f.__name__].append(t1-t0)
    # Shuffle the order so that one doesn't benefit more
    # from deep internal optimizations/caching in Postgre.
    random.shuffle(functions)

for k, values in times.items():
    print(k, [round(x * 1000, 2) for x in _stats(values)])

For the record, here are the parameters of this little benchmark:

The Result

Function Median Average Std Dev
f1 3.19ms 9.17ms 19.61ms
f2 2.28ms 6.28ms 15.30ms

The Conclusion

If you use the median, using select_related is 30% faster and if you use the average, using select_related is 46% faster.

So, if you know you're going to need to do that lookup put in .select_related(relation) before every .get(id=...) in your Django code.

Deep down in PostgreSQL, the inner join is ultimately two ID-by-index lookups. And that's what the first method is too. It's likely that the reason the inner join approach is faster is simply because there's less connection overheads.

Lastly, YOUR MILEAGE WILL VARY. Every benchmark is flawed but this quite realistic because it's not trying to be optimized in either way.

Django test optimization with no-op PIL engine

27 October 2016 6 comments   Python, Django


The Air Mozilla project is a regular Django webapp. It's reasonably big for a more or less one man project. It's ~200K lines of Python and ~100K lines of JavaScript. There are 816 "unit tests" at the time of writing. Most of them are kinda typical Django tests. Like:

def test_some_feature(self):
    thing = MyModel.objects.create(key='value')
    url = reverse('namespace:name', args=(thing.id,))
    response = self.client.get(url)
    ....

Also, the site uses sorl.thumbnail to automatically generate thumbnails from uploaded images. It's a great library.

However, when running tests, you almost never actually care about the image itself. Your eyes will never feast on them. All you care about is that there is an image, that it was resized and that nothing broke. You don't write tests that checks the new image dimensions of a generated thumbnail. If you need tests that go into that kind of detail, it best belongs somewhere else.

So, I thought, why not fake ALL operations that are happening inside sorl.thumbnail to do with resizing and cropping images.

Here's the changeset that does it. Note, that the trick is to override the default THUMBNAIL_ENGINE that sorl.thumbnail loads. It usually defaults to sorl.thumbnail.engines.pil_engine.Engine and I just wrote my own that does no-ops in almost every instance.

I admittedly threw it together quite quickly just to see if it was possible. Turns out, it was.

# Depends on setting something like:
#    THUMBNAIL_ENGINE = 'airmozilla.base.tests.testbase.FastSorlEngine'
# in your settings specifically for running tests.


from sorl.thumbnail.engines.base import EngineBase


class _Image(object):
    def __init__(self):
        self.size = (1000, 1000)
        self.mode = 'RGBA'
        self.data = '\xa0'


class FastSorlEngine(EngineBase):

    def get_image(self, source):
        return _Image()

    def get_image_size(self, image):
        return image.size

    def _colorspace(self, image, colorspace):
        return image

    def _scale(self, image, width, height):
        image.size = (width, height)
        return image

    def _crop(self, image, width, height, x_offset, y_offset):
        image.size = (width, height)
        return image

    def _get_raw_data(self, image, *args, **kwargs):
        return image.data

    def is_valid_image(self, raw_data):
        return bool(raw_data)

So, was it much faster?

It's hard to measure because the time it takes to run the whole test suite depends on other stuff going on on my laptop during the long time it takes to run the tests. So I ran them 8 times with the old code and 8 times with this new hack.

Iteration Before After
1 82.789s 73.519s
2 82.869s 67.009s
3 77.100s 60.008s
4 74.642s 58.995s
5 109.063s 80.333s
6 100.452s 81.736s
7 85.992s 61.119s
8 82.014s 73.557s
Average 86.865s 69.535s
Median 82.869s 73.519s
Std Dev 11.826s 9.0757s

So rougly 11% faster. Not a lot but it adds up when you're doing test-driven development or debugging where you run a suite or a test over and over as you're saving the files/tests you're working on.

Room for improvement

In my case, it just worked with this simple solution. Your site might do fancier things with the thumbnails. Perhaps we can combine forces on this and finalize a working solution into a standalone package.

django-html-validator - now locally, fast!

12 August 2016 1 comment   Python, Web development, Django


A couple of years ago I released a project called django-html-validator (GitHub link) and it's basically a Django library that takes the HTML generated inside Django and sends it in for HTML validation.

The first option is to send the HTML payload, over HTTPS, to https://validator.nu/. Not only is this slow but it also means sending potentially revealing HTML. Ideally you don't have any passwords in your HTML and if you're doing HTML validation you're probably testing against some test data. But... it sucked.

The other alternative was to download a vnu.jar file from the github.com/validator/validator project and executing it in a subprocess with java -jar vnu.jar /tmp/file.html. Problem with this is that it's really slow because java programs take such a long time to boot up.

But then, at the beginning of the year some contributors breathed fresh life into the project. Python 3 support and best of all; the ability to start the vnu.jar as a local server on http://localhost:8888 and HTTP post HTML over to that. Now you don't have to pay the high cost of booting up a java program and you don't have to rely on a remote HTTP call.

Now it becomes possible to have HTML validation checked on every rendered HTML response in the Django unit tests.

To try it, check out the new instructions on "Setting the vnu.jar path".

The contributor who's made this possible is Ville "scop" Skyttä, as well as others. Thanks!!

How to track Google Analytics pageviews on non-web requests (with Python)

03 May 2016 1 comment   Python, Web development, Django, Mozilla


tl;dr; Use raven's ThreadedRequestsHTTPTransport transport class to send Google Analytics pageview trackings asynchronously to Google Analytics to collect pageviews that aren't actually browser pages.

We have an API on our Django site that was not designed from the ground up. We had a bunch of internal endpoints that were used by the website. So we simply exposed those as API endpoints that anybody can query. All we did was wrap certain parts carefully as to not expose private stuff and we wrote a simple web page where you can see a list of all the endpoints and what parameters are needed. Later we added auth-by-token.

Now the problem we have is that we don't know which endpoints people use and, as equally important, which ones people don't use. If we had more stats we'd be able to confidently deprecate some (for easier maintanenace) and optimize some (to avoid resource overuse).

Our first attempt was to use statsd to collect metrics and display those with graphite. But it just didn't work out. There are just too many different "keys". Basically, each endpoint (aka URL, aka URI) is a key. And if you include the query string parameters, the number of keys just gets nuts. Statsd and graphite is better when you have about as many keys as you have fingers on one hand. For example, HTTP error codes, 200, 302, 400, 404 and 500.

Also, we already use Google Analytics to track pageviews on our website, which is basically a measure of how many people render web pages that have HTML and JavaScript. Google Analytic's UI is great and powerful. I'm sure other competing tools like Mixpanel, Piwik, Gauges, etc are great too, but Google Analytics is reliable, likely to stick around and something many people are familiar with.

So how do you simulate pageviews when you don't have JavaScript rendering? The answer; using plain HTTP POST. (HTTPS of course). And how do you prevent blocking on sending analytics without making your users have to wait? By doing it asynchronously. Either by threading or a background working message queue.

Threading or a message queue

If you have a message queue configured and confident in its running, you should probably use that. But it adds a certain element of complexity. It makes your stack more complex because now you need to maintain a consumer(s) and the central message queue thing itself. What if you don't have a message queue all set up? Use Python threading.

To do the threading, which is hard, it's always a good idea to try to stand on the shoulder of giants. Or, if you can't find a giant, find something that is mature and proven to work well over time. We found that in Raven.

Raven is the Python library, or "agent", used for Sentry, the open source error tracking software. As you can tell by the name, Raven tries to be quite agnostic of Sentry the server component. Inside it, it has a couple of good libraries for making threaded jobs whose task is to make web requests. In particuarly, the awesome ThreadedRequestsHTTPTransport. Using it basically looks like this:

import urlparse
from raven.transport.threaded_requests import ThreadedRequestsHTTPTransport

transporter = ThreadedRequestsHTTPTransport(
    urlparse.urlparse('https://ssl.google-analytics.com/collect'),
    timeout=5
)

params = {
    ...more about this later...
}

def success_cb():
    print "Yay!"

def failure_cb(exception):
    print "Boo :("

transporter.async_send(
    params,
    headers,
    success_cb,
    failure_cb
)

The call isn't very different from regular plain old requests.post.

About the parameters

This is probably the most exciting part and the place where you need some thought. It's non-trivial because you might need to put some careful thought into what you want to track.

Your friends is: This documentation page

There's also the Hit Builder tool where you can check that the values you are going to send make sense.

Some of the basic ones are easy:

"Protocol Version"

Just set to v=1

"Tracking ID"

That code thing you see in the regular chunk of JavaScript you put in the head, e.g tid=UA-1234-Z

"Data Source"

Optional word you call this type of traffic. We went with ds=api because we use it to measure the web API.

The user ones are a bit more tricky. Basically because you don't want to accidentally leak potentially sensitive information. We decided to keep this highly anonymized.

"Client ID"

A random UUID (version 4) number that identifies the user or the app. Not to be confused with "User ID" which is basically a string that identifies the user's session storage ID or something. Since in our case we don't have a user (unless they use an API token) we leave this to a new random UUID each time. E.g. cid=uuid.uuid4().hex This field is not optional.

"User ID"

Some string that identifies the user but doesn't reveal anything about the user. For example, we use the PostgreSQL primary key ID of the user as a string. It just means we can know if the same user make several API requests but we can never know who that user is. Google Analytics uses it to "lump" requests together. This field is optional.

Next we need to pass information about the hit and the "content". This is important. Especially the "Hit type" because this is where you make your manually server-side tracking act as if the user had clicked around on the website with a browser.

"Hit type"

Set this to t=pageview and it'll show up Google Analytics as if the user had just navigated to the URL in her browser. It's kinda weird to do this because clearly the user hasn't. Most likely she's used curl or something from the command line. So it's not really a pageview but, on our end, we have "views" in the webserver that produce information to the user. Some of it is HTML and some of it is JSON, in terms of output format, but either way they're sending us a URL and we respond with data.

"Document location URL"

The full absolute URL of that was used. E.g. https://www.example.com/page?foo=bar. So in our Django app we set this to dl=request.build_absolute_uri(). If you have a site where you might have multiple domains in use but want to collect them all under just 1 specific domain you need to set dh=example.com.

"Document Host Name" and "Document Path"

I actually don't know what the point of this is if you've already set the "Document location URL".

"Document Title"

In Google Analytics you can view your Content Drilldown by title instead of by URL path. In our case we set this to a string we know from the internal Python class that is used to make the API endpoint. dt='API (%s)'%api_model.__class__.__name__.

There are many more things you can set, such as the clients IP, the user agent, timings, exceptions. We chose to NOT include the user's IP. If people using the JavaScript version of Google Analytics can set their browser to NOT include the IP, we should respect that. Also, it's rarely interesting to see where the requests for a web API because it's often servers' curl or requests that makes the query, not the human.

Sample implementation

Going back to the code example mentioned above, let's demonstrate a fuller example:

import urlparse
from raven.transport.threaded_requests import ThreadedRequestsHTTPTransport

transporter = ThreadedRequestsHTTPTransport(
    urlparse.urlparse('https://ssl.google-analytics.com/collect'),
    timeout=5
)

# Remember, this is a Django, but you get the idea

domain = settings.GOOGLE_ANALYTICS_DOMAIN
if not domain or domain == 'auto':
    domain = RequestSite(request).domain

params = {
    'v': 1,
    'tid': settings.GOOGLE_ANALYTICS_ID,
    'dh': domain,
    't': 'pageview,
    'ds': 'api',
    'cid': uuid.uuid4().hext,
    'dp': request.path,
    'dl': request.build_request_uri(),
    'dt': 'API ({})'.format(model_class.__class__.__name__),
    'ua': request.META.get('HTTP_USER_AGENT'),
}

def success_cb():
    logger.info('Successfully informed Google Analytics (%s)', params)

def failure_cb(exception):
    logger.exception(exception)

transporter.async_send(
    params,
    headers,
    success_cb,
    failure_cb
)

How to unit test this

The class we're using, ThreadedRequestsHTTPTransport has, as you might have seen, a method called async_send. There's also one, with the exact same signature, called sync_send which does the same thing but in a blocking fashion. So you could make your code look someting silly like this:

def send_tracking(page_title, request, async=True):
    # ...same as example above but wrapped in a function...
    function = async and transporter.async_send or transporter.sync_send
    function(
        params,
        headers,
        success_cb,
        failure_cb
    )

And then in your tests you pass in async=False instead.
But don't do that. The code shouldn't be sub-serviant to the tests (unless it's for the sake of splitting up monster-long functions).
Instead, I recommend you mock the inner workings of that ThreadedRequestsHTTPTransport class so you can make the whole operation synchronous. For example...

import mock
from django.test import TestCase
from django.test.client import RequestFactory

from where.you.have import pageview_tracking


class TestTracking(TestCase):

    @mock.patch('raven.transport.threaded_requests.AsyncWorker')
    @mock.patch('requests.post')
    def test_pageview_tracking(self, rpost, aw):

        def mocked_queue(function, data, headers, success_cb, failure_cb):
            function(data, headers, success_cb, failure_cb)

        aw().queue.side_effect = mocked_queue

        request = RequestFactory().get('/some/page')
        with self.settings(GOOGLE_ANALYTICS_ID='XYZ-123'):
            pageview_tracking('Test page', request)

            # Now we can assert that 'requests.post' was called.
            # Left as an exercise to the reader :)
            print rpost.mock_calls       

This is synchronous now and works great. It's not finished. You might want to write a side effect for the requests.post so you can have better control of that post. That'll also give you a chance to potentially NOT return a 200 OK and make sure that your failure_cb callback function gets called.

How to manually test this

One thing I was very curious about when I started was to see how it worked if you really ran this for reals but without polluting your real Google Analytics account. For that I built a second little web server on the side, whose address I used instead of https://ssl.google-analytics.com/collect. So, change your code so that https://ssl.google-analytics.com/collect is not hardcoded but a variable you can change locally. Change it to http://localhost:5000/ and start this little Flask server:

import time
import random
from flask import Flask, abort, request

app = Flask(__name__)
app.debug = True

@app.route("/", methods=['GET', 'POST'])
def hello():
    print "- " * 40
    print request.method, request.path
    print "ARGS:", request.args
    print "FORM:", request.form
    print "DATA:", repr(request.data)
    if request.args.get('sleep'):
        sec = int(request.args['sleep'])
        print "** Sleeping for", sec, "seconds"
        time.sleep(sec)
        print "** Done sleeping."
    if random.randint(1, 5) == 1:
        abort(500)
    elif random.randint(1, 5) == 1:
        # really get it stuck now
        time.sleep(20)
    return "OK"

if __name__ == "__main__":
    app.run()

Now you get an insight into what gets posted and you can pretend that it's slow to respond. Also, you can get an insight into how your app behaves when this collection destination throws a 5xx error.

How to really test it

Google Analytics is tricky to test in that they collect all the stuff they collect then they take their time to process it and it then shows up the next day as stats. But, there's a hack! You can go into your Google Analytics account and click "Real-Time" -> "Overview" and you should see hits coming in as you're testing this. Obviously you don't want to do this on your real production account, but perhaps you have a stage/dev instance you can use. Or, just be patient :)

How to no-mincss links with django-pipeline

03 February 2016 2 comments   Python, Web development, Django


This might be the kind of problem only I have, but I thought I'd share in case others are in a similar pickle.

Warming Up

First of all, the way my personal site works is that every rendered page gets cached as rendered HTML. Midway, storing the rendered page in the cache, an optimization transformation happens. It basically takes HTML like this:

<html>
<link rel="stylesheet" href="vendor.css">
<link rel="stylesheet" href="stuff.css">
<body>...</body>
</html>

into this:

<html>
<style>
/* optimized contents of vendor.css and stuff.css minified */
</style>
<body>...</body>
</html>

Just right-click and "View Page Source" and you'll see.

When it does this it also filters out CSS selectors in those .css files that aren't actually used in the rendered HTML. This makes the inlined CSS much smaller. Especially since so much of the CSS comes from a CSS framework.

However, there are certain .css files that have references to selectors that aren't in the generated HTML but are needed later when some JavaScript changes the DOM based on AJAX or user actions. For example, the CSS used by the Autocompeter widget. The program that does this CSS optimization transformation is called mincss and it has a feature where you can tell it to NOT bother with certain CSS selectors (using a CSS comment) or certain <link> tags entirely. It looks like this:

<link rel="stylesheet" href="ajaxstuff.css" data-mincss="no">

Where Does django-pipeline Come In?

So, setting that data-mincss="no" isn't easy when you use django-pipeline because you don't write <link ... in your Django templates, you write {% stylesheet 'name-of-bundle %}. So, how do you get it in?

Well, first let's define the bundle. In my case it looks like this:

PIPELINE_CSS = {
  ...
  # Bundle of CSS that strictly isn't needed at pure HTML render-time
  'base_dynamic': {
        'source_filenames': (
            'css/transition.css',
            'autocompeter/autocompeter.min.css',
        ),
        'extra_context': {
            'no_mincss': True,
        },
        'output_filename': 'css/base-dynamic.min.css',
    },
    ...
}

But that isn't enough. Next, I need to override how django-pipeline turn that block into a <link ...> tag. To do that, you need to create a directory and file called pipeline/css.html (or pipeline/css.jinja if you use Jinja rendering by default).

So take the default one from inside the pipeline package and copy it into your project into one of your apps's templates directory. For example, in my case, peterbecom/apps/base/templates/pipeline/css.jinja. Then, in that template add at the very end somehting like this:

{% if no_mincss %} data-mincss="no"{% endif %} />

The Point?

The point is that if you're in a similar situation where you want django-pipeline to output the <link> or <script> tag differently than it's capable of, by default, then this is a good example of that.

Headsupper.io

05 December 2015 0 comments   Python, Web development, Django, Javascript, ReactJS


tl;dr

Headsupper.io is a free GitHub webhook service that emails people when commits have the configurable keyword "headsup" in it.

Introduction

Headsupper.io is great for when you have a GitHub project with multiple people working on it and when you make a commit you want to notify other people by email.

Basically, you set up a GitHub Webhook, on pushes, to push to https://headsupper.io and then it'll parse the incoming push and its commits and look for certain things in the commit message. By default, it'll look for the word "headsup". For example, a git commit message might look like this:

fixes #123 - more juice in the Saab headsup! will require updating

Or you can use the multi-line approach where the first line is short and sweat and after the break a bit more elaborate:

bug 1234567 - tea kettle upgrade 2.1

Headsup: Next time you git pull from master, remember to run 
peep install on the requirements.txt file since this commit 
introduces a bunch of crazt dependency changes.

Git commits that come through that don't have any match on this word will simply be ignored by Headsupper.

How you use it

Maybe paradoxically, you need to authenticate with your GitHub account but that's in read-only mode and does NOT set up the Webhook for you. The reason you have to authenticate to prepare a configuration on headsupper.io is to tie the configuration to a real user.

Once you've authenticated you get the option to create your first configuration, then you have to enter at least these three piece of information:

  1. The GitHub "full name". This is the org name, slash, repo name. E.g. peterbe/django-peterbecom or mozilla/socorro.
  2. Pick a secret. Remember what you typed, because you'll need to type in this same secret when you set up the Webhook on your GitHub project's Webhooks page. (This is used to checksum and verify the source of the Webhook push)
  3. Who to send to. A list of email addresses separated with a newline or a semi-colon.

Once you've set that up, you'll need to go to your GitHub project's Setting page and enter a new Webhook and the URL you need to type in is https://headsupper.io and for the "Secret" type in that secret you used earlier. That's it!

Rules and options

The word that triggers is configurable by you. The default is headsupper. And by default, it's case insensitive. You can change that so it's case sensitive. Also, the word has to be word delimited on the left (e.g. a space or a newline character) and on the right it needs to be a space, a : or a !. So this won't match: theheadsup: or headsupper.

Other optional things you can configure are:

That last option, Only send when a new tag is created, is interesting. I added that option because at work, we make production server releases by pushing a git tag. When a tag is pushed, all those commits are sent to the continuous deployment service which makes a server upgrade. This means you get a chance to enter a heads up message to be emailed to the people who care about new deployments going out.

How it was built

It's a mix between Django and ReactJS. The whole client-side app it built statically with Webpack in ES6. It's served as static files through Nginx. But Nginx is making an exception on all URLs that start with /api or /accounts. The /api/* it used for loading and setting JSON. The /accounts/* is used for the GitHub OAuth endpoints.

What's interesting about this the architecture is that it's using HTTP cookies. Not API tokens. Cookies are quite good in that they're established and the browser does all the automated work of keeping it secure and making each request potentially authenticated.

Here's the relevant React code and here's the relevant Django code that processes the Webhook.

The whole project is available on: https://github.com/peterbe/headsupper.

Also, I made a demo at the November Mozilla Beer and Tell.

Django forms and making datetime inputs localized

04 December 2015 2 comments   Python, Django


tl;dr

To change from one timezone aware datetime to another, turn it into a naive datetime and then use pytz's localize() method to convert it back to the timezone you want it to be.

Introduction

Suppose you have a Django form where you allow people to enter a date, e.g. 2015-06-04 13:00. You have to save it timezone aware, because you have settings.USE_TZ on and it's just many times to store things in timezone aware dates.

By default, if you have settings.USE_TZ and no timezone information is in the string that the django.form.fields.DateTimeField parses, it will use settings.TIME_ZONE and that timezone might be different from what it really should be. For example, in my case, I have an app where you can upload a CSV file full of information about events. These events belong to a venue which I have in the database. Every venue has a timezone, e.g. Europe/Berlin or US/Pacific. So if someone uploads a CSV file for the Berlin location 2015-06-04 13:00 means 13:00 o'clock in Berlin. I don't care where the server is hosted and what its settings.TIME_ZONE is. I need to make that input timezone aware specifically for Berlin/Europe.

Examples

Suppose you have settings.TIME_ZONE == 'US/Pacific' and you let the django.form.fields.DateTimeField do its magic you get something you don't want:

>>> from django.conf import settings
>>> settings.TIME_ZONE
'US/Pacific'
>>> assert settings.USE_TZ
>>> from django.forms.fields import DateTimeField
>>> DateTimeField().clean('2015-06-04 13:00')
datetime.datetime(2015, 6, 4, 13, 0, tzinfo=<DstTzInfo 'US/Pacific' PDT-1 day, 17:00:00 DST>)

See! That's wrong. Sort of. Not Django's fault. What I need to do is to convert that datetime object into one that is timezone aware on the Europe/Berlin timezone.

In old versions of pytz, specifically <=2014.2 you could do this:

>>> import pytz
>>> pytz.VERSION
'2014.2'
>>> from django.forms.fields import DateTimeField
>>> date = DateTimeField().clean('2015-06-04 13:00')
>>> date
datetime.datetime(2015, 6, 4, 13, 0, tzinfo=<DstTzInfo 'US/Pacific' PDT-1 day, 17:00:00 DST>)
>>> date.replace(tzinfo=tz)
datetime.datetime(2015, 6, 4, 13, 0, tzinfo=<DstTzInfo 'Europe/Berlin' CET+1:00:00 STD>)

But in modern versions of pytz you can't do that because if you don't use the pytz.timezone instance to localize it will use the default version which might be one of those crazy "Local Mean Time" which they used a 100 years ago. E.g.

>>> import pytz
>>> pytz.VERSION
'2015.7'
>>> from django.forms.fields import DateTimeField
>>> date = DateTimeField().clean('2015-06-04 13:00')
>>> tz = pytz.timezone('Europe/Berlin')
>>> date.replace(tzinfo=tz)
datetime.datetime(2015, 6, 4, 13, 0, tzinfo=<DstTzInfo 'Europe/Berlin' LMT+0:53:00 STD>)

See, it's that crazy LMT+0:53:00 that's oft talked of on Stackoverflow!

Here's the trick

The trick is to use pytz.timezone(MY TIME ZONE NAME).localize(MY NAIVE DATETIME OBJECT). When you use the .localize() method pytz can use the date to make sure it uses the right conversion for that named timezone.

And in the case of our overly smart django.form.fields.DateTimeField it means we need to convert it back into a naive datetime object and then localize it.

>>> import pytz
>>> pytz.VERSION
'2015.7'
>>> from django.forms.fields import DateTimeField
>>> date = DateTimeField().clean('2015-06-04 13:00')
>>> date = date.replace(tzinfo=None)
>>> date
datetime.datetime(2015, 6, 4, 13, 0)
>>> tz = pytz.timezone('Europe/Berlin')
>>> tz.localize(date)
datetime.datetime(2015, 6, 4, 13, 0, tzinfo=<DstTzInfo 'Europe/Berlin' CEST+2:00:00 DST>)

That was much harder than it needed to be. Timezones are hard. Especially when you have the human element of people typing in things and just, rightfully, expect the system to figure it out and get it right.

I hope this helps the next schmuck who has/had to set aside an hour to figure this out.

django-pipeline + django-jinja

04 October 2015 2 comments   Django


Do you have django-jinja in your Django 1.8 project to help you with your Jinja2 integration, and you use django-pipeline for your static assets?
If so, you need to tie them together by passing pipeline.templatetags.ext.PipelineExtension "to your Jinja2 environment". But how? Here's how:

# in your settings.py


from django_jinja.builtins import DEFAULT_EXTENSIONS

TEMPLATES = [
    {
        'BACKEND': 'django_jinja.backend.Jinja2',
        'APP_DIRS': True,
        'OPTIONS': {
            'match_extension': '.jinja',
            'context_processors': [
                ...
            ],
            'extensions': DEFAULT_EXTENSIONS + [
                'pipeline.templatetags.ext.PipelineExtension',
            ],
        }
    },
    ...

Now, in your template you simply use the {% stylesheet '...' %} or {% javascript '...' %} tags in your .jinja templates without the {% load pipeline %} stuff.

It took me a little while to figure that out so I hope it helps someone else googling around for a solution alike.