(For context, I released last week and now I'm thinking about improvements)

I posted a question on Twitter about which highlighting formatting people prefer and got some interesting feedback. More about that later.

The piece of feedback that really got my attention came from my friend Honza Král.
He wondered if not the whole word should be highlighted instead of just the beginning of the word.

I've actually been thinking about that too but never got around to trying it out. Until now.





What do you think?

I have the code in a branch and I'm still mulling it over. There's sort of a convention to just highlight based on what you've typed so far. I don't want to be too weird because when people don't feel familiar they don't like what they see even if the new actually is better.

In airmozilla the tests almost all derive from one base class whose tearDown deletes the automatically generated settings.MEDIA_ROOT directory and everything in it.

Then there's some code that makes sure a certain thing from the fixtures has a picture uploaded to it.

That means it has do that shutil.rmtree(directory) and that shutil.copy(src, dst) on almost every single test. Some might also not need or depend on it but it's conveninent to put it here.

Anyway, I thought this is all a bit excessive and I could probably optimize that by defining a custom test runner that is first responsible for creating a clean settings.MEDIA_ROOT with the necessary file in it and secondly, when the test suite ends, it deletes the directory.

But before I write that, let's measure how many gazillion milliseconds this is chewing up.

Basically, the tearDown was called 361 times and the _upload_media 281 times. In total, this adds to a whopping total of 0.21 seconds! (of the total of 69.133 seconds it takes to run the whole thing).

I think I'll cancel that optimization idea. Doing some light shutil operations are dirt cheap.

From the It-Depends-on-What-You're-Building department.

As a web developer you have a job:

  1. Display a certain amount of database data on the screen
  2. Do it as fast as possible

The first point is these days easily taken care of with the likes of Django or Rails which makes it über easy to write queries that you then use in templates to generate the HTML and voila you have a web page.

The second point is taken care of with a myriad of techniques. It's almost a paradox. The fastest way to render something on the screen is to generate everything on the server and send it wholesome. It means the browser can very quickly (and boosted by GPU) render something on the screen. But if you have a lot of data that needs to be displayed it's often better to send just a little bit of HTML and then let some Javascript kick in and take care of extracting the rest of the information using AJAX.

Here I have prepared three different versions of ways to display a bunch of information on the screen:

Visual comparison on WebPagetest
What you should note and take away from this little experimental playground:

  1. All server-side work is done in Django but it's served straight out of memcache so it should be fast server-side.

  2. The content is NOT important. It's just a list of blog posts and their categories and keywords.

  3. To make it somewhat realistic, each version needs to 1) display a JPG and 2) have a Javascript onclick event that throws a confirm() dialog box.

  4. The AngularJS version loads significantly slower but it's not because AngularJS is slow, but because it's able to do so much more later. Loading a Javascript framework is like an investment. Big cost upfront and small cost later when you need more magic to happen without having a complete server refresh.

  5. View 1, 2 and 3 are all three imperfect versions but they illustrate the three major groups of solving the problem stated at the top of this blog post. The other views are attempts of optimizations.

  6. Clearly the "visually fastest" version is the optimization version 5 which is a fork of version 2 which loads, on the server-side, everything that is above the fold and then take care of the content below the fold with AJAX.
    See this visual comparison

  7. Optimization version 4 was a silly optimization. It depends on the fact that JSON is more "compact" than HTML. When you Gzip the content, the difference in size doesn't matter anymore. However, it's an interesting technique because it means you can do all business logic rendering stuff in one language without having to depend on AJAX.

  8. Open the various versions in your browser and try to "feel" how pages the load. Ask your inner gutteral heart which version you prefer; do you prefer a completely blank screen and a browser loading spinner or do you prefer to see some skeleton structure first whilst waiting for the bulk content comes in?

  9. See this as a basis of thoughts and demonstration. Remember the very first sentence in this blog post.

tl;dr Don't run ffmpeg over HTTP(S) and use ffmpegthumbnailer

UPDATE tl;dr Download the file then run ffmpeg with -ss HH:MM:SS first. Don't bother with ffmpegthumbnailer

At work I work on something called Air Mozilla. It's a site for hosting live video broadcasts and then archiving those so they can be retrieved later.

Unlike sites like YouTube we can't take a screencap from the video because many videos are future (aka. "upcoming") videos so instead we use a little placeholder thumbnail (for example, the Rust logo).

However, once it has been recorded we want to switch from the logo to an actual screen capture from the video itself. We set up a cronjob that uses ffmpeg to extract these as JPGs and then the users can go in and select whichever picture they like the best.

This is all work in progress by the way (as of December 2014).

One problem is that we have is that the command for extracting JPGs is really slow. So slow that we can't wrap the subprocess in a Django database connection because it's so slow that the database connection is often killed.

The command to extract them looks something like this:

ffmpeg -i -r 0.0143 /tmp/screencaps-%02d.jpg

Where the number r is based on the duration and how many pictures we want out. E.g. 0.0143 = 15 * 1049 where 15 is how many JPGs we want and 1049 is a duration of 17 minutes and 29 seconds.

The script I used first was:

My first experiment was to try to extract one picture at a time, hoping that way, internally, ffmpeg might be able to optimize something.

The second script I used was:

The third alternative was to try ffmpegthumbnailer which is an intricate wrapper on ffmpeg and it has the benefit that you can produce slightly higher picture quality too.

The third script I used was:

Bar chart comparing the 3 different scripts
And running these three depend very much on the state of my DSL at the time.

For a video clip that is 17 minutes long and a 138Mb mp4 file.   2m0.847s   11m46.734s   0m29.780s

Clearly it's not efficient to do one screenshot at a time.
Because with ffmpegthumbnailer you can tell it not to reduce the picture quality the total weight of the produced JPGs from was 784Kb and the total weight from was 1.5Mb.

Just to try again, I ran a similar experiment with a 35 minutes long and 890Mb mp4 file. And this time I didn't bother with The results were:   18m21.330s   2m48.656s

So that means that using ffmpegthumbnailer is about 5 times faster than ffmpeg. Huge difference!

And now, a curveball!

The reason for doing ffmpeg -i https://... was so that we don't have to first download the whole beast and run the command on a local file. However, in light of how so much longer this takes and my disdain to have to install and depend on a new tool (ffmpegthumbnailer) across all servers. Why not download the whole file and run the ffmpeg command locally.

So I download the file and it's slow because of my, currently, terrible home DSL. Then I run and time them again but just a local file instead:   0m20.426s   0m0.635s

Did you see that!? That's an insane difference. Clearly doing this command over HTTP(S) is a bad idea. It'll be worth downloading it first.


On Stackoverflow, LordNeckBeard gave a great tip of using the -ss option before in the input file and now it's much faster. At this point. I'm no longer interested in having to bother with ffmpegthumbnailer.

Let's fork into two versions. same as but a downloaded file instead of a remote HTTPS URL. as except we put the -ss HH:MM:SS before the input file.

Now, let's run them again on the 138Mb file:

# the 138Mb mp4.mp4 file   2m10.898s   0m0.672s

187 times faster

And again, I re-ran this again against a bigger file that is 1.4Gb:

# the 1.4Gb mp4-1.44Gb.mp4 file   10m1.143s   0m1.428s

420 times faster

Here’s an example of unescaped & characters in a A HREF tag attribute. It’s working fine.

I know it might break XML and possibly XHTML but who uses that still?

Red. So what?
And I know an unescaped & in a href shows as red in the View Source color highlighting.

What can go wrong? Why is it important? Perhaps it used to be in 2009 but no longer the case.

This all started because I was reviewing some that uses python urllib.urlencode(...) and inserts the results into a Django template with href="{{ result_of_that_urlencode }}" which would mean you get un-escaped & characters and then I tried to find how and why that is bad but couldn't find any examples of it.

In action
A couple of weeks ago we had accidentally broken our production server (for a particular report) because of broken HTML. It was an unclosed tag which rendered everything after that tag to just plain white. Our comprehensive test suite failed to notice it because it didn't look at details like that. And when it was tested manually we simply missed the conditional situation when it was caused. Neither good excuses. So it got me thinking how can we incorporate HTML (html5 in particular) validation into our test suite.

So I wrote a little gist and used it a bit on a couple of projects and was quite pleased with the results. But I thought this might be something worthwhile to keep around for future projects or for other people who can't just copy-n-paste a gist.

With that in mind I put together a little package with a README and a and now you can use it too.

There are however some caveats. Especially if you intend to run it as part of your test suite.

Caveat number 1

You can't flood Well, you can I guess. It would be really evil of you and kittens will die. If you have a test suite that does things like response = self.client.get(reverse('myapp:myview')) and there are many tests you might be causing an obscene amount of HTTP traffic to them. Which brings us on to...

Caveat number 2

The site is written in Java and it's open source. You can basically download their validator and point django-html-validator to it locally. Basically the way it works is java -jar vnu.jar myfile.html. However, it's slow. Like really slow. It takes about 2 seconds to run just one modest HTML file. So, you need to be patient.

I just rolled out a change here on my personal blog which I hope will make my few visitors happy.

Basically; when you hover over a link (local link) long enough it prefetches it (with AJAX) so that if you do click it's hopefully already cached in your browser.

If you hover over a link and almost instantly hover out it cancels the prefetching. The assumption here is that if you deliberately put your mouse cursor over a link and proceed to click on it you want to go there. Because your hand is relatively slow I'm using the opportunity to prefetch it even before you have clicked. Some hands are quicker than others so it's not going to help for the really quick clickers.

What I also had to do was set a Cache-Control header of 1 hour on every page so that the browser can learn to cache it.

The effect is that when you do finally click the link, by the time your browser loads it and changes the rendered output it'll hopefully be able to do render it from its cache and thus it becomes visually ready faster.

Let's try to demonstrate this with this horrible animated gif:
(or download the file)

1. Hover over a link (in this case the "Now I have a Gmail account" from 2004)
2. Notice how the Network panel preloads it
3. Click it after a slight human delay
4. Notice that when the clicked page is loaded, its served from the browser cache
5. Profit!

So the code that does is is quite simply:

$(function() {
  var prefetched = [];
  var prefetch_timer = null;
  $('div.navbar, div.content').on('mouseover', 'a', function(e) {
    var value =;
    if (value.indexOf('/') === 0) {
      if (prefetched.indexOf(value) === -1) {
        if (prefetch_timer) {
        prefetch_timer = setTimeout(function() {
          $.get(value, function() {
            // necessary for $.ajax to start the request :(
        }, 200);
  }).on('mouseout', 'a', function(e) {
    if (prefetch_timer) {

Also, available on GitHub.

I'm excited about this change because of a couple of reasons:

  1. On mobile, where you might be on a non-wifi data connection you don't want this. There you don't have the mouse event onmouseover triggering. So people on such devices don't "suffer" from this optimization.
  2. It only downloads the HTML which is quite light compared to static assets such as pictures but it warms up the server-side cache if needs be.
  3. It's much more targetted than a general prefetch meta header.
  4. Most likely content will appear rendered to your eyes faster.

MozTrap is what's called a "test case management system". Basically, software QA people need a structure and pattern to their testing. What to test, what versions to test on and what hardware/operatting system etc all is part of a "test suite". That's what MozTrap manages.

So this project was built by Mozilla's automation and tools team. It is currently not an actively developed project. Not because it's not needed or used but because it basically maps all the features we need. A large part of the code base was originally written by a personal friend of mine who I respect wholeheartedly; Carl Meyer of Django/pip/virtualenv/etc fame. I'm grateful for the awesome documentation he left behind amongst many other things.

Together with the team we sat down and listed all the biggest pain points as of today. Basically, the number one thing is speed. Pages load too slowly. Normally when web developers worry themselves with web performance it's a matter of shaving milliseconds off a page where a clients perception equals lost or gained profits. Here's not a problem of milliseconds but a problem of seconds. After some quick poking around on the production site and looking at some code the conclusion is simple: The site is so darn slow because the HTML sent from the server is way too MASSIVE. And baked into that is a mixture of the poor web server having to produce a massive HTML blob and it being sent over the wire.

One test run I made said it took 14 seconds to render a certain page.

Why is it so slow?

So how did this happen and why is it not Carl's fault? :) The reason it happened was because of the underestimated number of options added to the advanced filtering drop-downs. On a local dev version you never notice these things because you set up some options, for example tags, and the drop-down never gets larger than 10-20 options. For example, the "Creator" drop-down today has 1,664 different choices.

If you take all those choices and turn thing into a HTML like this: <option value="1">Adam</option>\n<option value="2">Bram</option>... etc. you get 66Kb of just HTML. However, MozTrap doesn't work like that. Instead it uses pretty drop-downs that don't look like regular HTML drop-downs. See for yourself; go to and click the "Advanced Filtering" button.
So, that means that the HTML for each option instead looks like this:

<li class="filter-item">
  <input name="filter-creator" data-name="creator" value="1" id="id-filter-creator-1" class="check" type="checkbox">
  <span class="onoff">
    <label for="id-filter-creator-1" class="onoffswitch">Adam</label>
                <span class="pinswitch"></span>
    <span class="content" title="creator: Adam">Adam</span>

Now you get 620Kb of just HTML just for the "Creators" field. Granted, that is the biggest field of all the drop-downs but lots of them are massive.

So, this makes the page weigh a total of about 1.1Mb just for the HTML. Not only is it a lot of work for the (Django) server to generate this but it's also a heck of a lot of data to send across the Internet on every page request.

So what was the solution?

An ideal solution would have been a significant re-write whereby much of the values of the page gets rewritten as later AJAX calls. I.e. load a skeleton that loads superfast, and then load some AJAX in the background. That AJAX could potentially be cached in the browser with localStorage or something so that you get something to show very quickly whilst you wait for the AJAX request to complete. But this would have been too big a change and the way the filtering works on these pages, you actually need all the options in the drop-downs on immediate load because of the way "pinned filters" work.

So the solution was to replace all the repeated HTML chunks with 1 JSON string and then a piece of Javascript template rendering. So, in the Django template code instead of:

{% for field in filters %}
  {% include "lists/_filter_group.html" with advanced=1 prefix="filter" pinable=1 %}
{% endfor %}

We now replace this with:

var FILTERS = {% filterset_to_json filters with advanced=1 prefix="filter" pinable=1 %}
<script id="filter_group" type="text/html">
<section class="filter-group {{ field.cls }}" data-name="{{ field.key }}">
  <h5 class="category-title">
    {{ _field_name_lower }}
    {{# field.switchable }}

What that basically is is some Mustache code that I use to generate the HTML DOM nodes and insert into the page after load.

In conclusion

So basically nothing changes. Nothing of the Django view had to change. Visually there's no difference and the same actual user data is sent from the server to the client but just packed in a more optimal way.

There are multiple pages where these massive "Advanced Filtering" options exist but on one page I measured the whole page went from weighing 1.1Mb down to 132Kb.

On Friday I did a Show HN and got featured on the front page for HTML Tree.

Google Analytics
Amazingly, out of the 3,858 visitors (according to Google Analytics today) 2,034 URLs were submitted and tested on the app. Clearly a lot of people just clicked the example submission but out of those 1,634 were unique. Granted, some people submitted more than one URL but I think a large majority of people came up with a URL of their own to try. Isn't that amazing! What a turnout of a Friday afternoon hack (with some Sunday night hacking to make it into a decent looking website).

The lesson to learn here is that the Hacker News crowd is excellent for getting engagement. Yes, there are a lot of blather and almost repetitive submissions but by and large it's a very engaging community. Suck on that those who make fun of HN!

I have now closed issue #2 on github-pr-triage. So, now you can have a dashboad of every GitHub project whose pull requests you care about.

The only format of using just 1 repo works too. E.g. /owner/project) and should hopefully not break anybody's bookmarks. The new format for having multiple repos across (possibly) multiple owners is like this:


See screenshot:

A couple of different projects

To set yours up, here's a running instance available on