Peterbe.com

A blog and website by Peter Bengtsson

This morning I came across this site on Hacker News. It's a cute site with some basic tips on how to make your sites faster.

It's very much a for-beginners document as all the tips are quite basic. For example it doesn't even mention the use of CDNs.

One tip in particular stood out to me: "it can be useful to minify your HTML with automated tools."
And it links to the htmlcompressor project. Ignore this advice.

What matters 10 times more is Gzip compression. This is usually very easy to set up with Nginx or Apache. It's not something you do in your web framework and if you don't have a web framework, you don't need to manully Gzip HTML files on the filesystem.

For example, downloading the home page here on my blog, at the time of writing, this is: 66,770 bytes big. Hefty, sure, but with all excess whitespace removed it reduces down to 59,356 bytes. But that really doesn't matter when you Gzip.

Gzipped from original version: 18,470 bytes
Gzipped from whitespace trimmed version: 18,086 bytes

The gain is 2% which is definitely not worth the hassle of adding a whitespace compressor.

If you use django-fancy-cache you can either run with stats or without. With stats, you can get a number of how many times a cache key "hits" and how many times it "misses". Keeping stats incurs a small performance slowdown. But how much?

I created a simple page that either keeps stats or ignores it. I ran the benchmark over Nginx and Gunicorn with 4 workers. The cache server is a memcached running on the same host (my OSX 10.7 laptop).

With stats:

Average: 768.6 requests/second
Median: 773.5 requests/second
Standard deviation: 14.0

Without stats:

Average: 808.4 requests/second
Median: 816.4 requests/second
Standard deviation: 30.0

That means, roughly that running with stats incurs a 6% slower performance.

The stats is completely useless to your users. The stats tool is purely for your own curiousity and something you can switch on and off easily.

Note: This benchmark assumes that the memcached server is running on the same host as the Nginx and the Gunicorn server. If there was more network in between, obviously all the .incr() commands would cause more performance slowdown.

This personal blog site of mine uses django-fancy-cache and mincss.

What that means is that I can cache the whole output of every blog post for weeks and when I do that I can first preprocess the HTML and convert every external CSS into one inline STYLE block which will only reference selectors that are actually used.

To see it in action, right-click and select "View Page Source". You'll see something like this:

/*
Stats about using github.com/peterbe/mincss
-------------------------------------------
Requests:         1 (now: 0)
Before:           81Kb
After:            11Kb
After (minified): 11Kb
Saving:           70Kb
*/
section{display:block}html{font-size:100%;-webkit-text-size-adjust:100%;-ms-tex...

The reason the saving is so huge, in my case, is because I'm using Twitter Bootstrap CSS framework which is awesome but as any framework, it will inevitably contain a bunch of stuff that I don't use. Some stuff I don't use on any page at all. Some stuff is used only on some pages and some other stuff is used only on some other pages.

What I gain by this, is faster page loads. What the browser does is that it, gets a URL, downloads all HTML, opens the HTML to look for referenced CSS (using the link tag) and downloads that too. Once all of that is downloaded, it starts to render the page. Approximately after that it starts to download all referenced Javascript and starts evaluating and executing that.

By not having to download the CSS the browser has one less thing to do. Only one request? Well, that request might be on a CDN (not a great idea actually) so even though it's just 1 request it will involve another DNS look-up.

Here's what the loading of the homepage looks like in Firefox from a US east coast IP.

Granted, a downloaded CSS file can be cached by the browser and used for other pages under the same domain. But, on my blog the bounce rate is about 90%. That doesn't necessarily mean that visitors leave as soon as they arrived, but it does mean that they generally just read one page and then leave. For those 10% of visitors who visit more than one page will have to download the same chunk of CSS more than once. But mind you, it's not always the same chunk of CSS because it's different for different pages. And the amount of CSS that is now in-line only adds about 2-3Kb on the HTML load when sent gzipped.

Getting to this point wasn't easy because I first had to develop mincss and django-fancy-cache and integrate it all. However, what this means is that you can have it done on your site too! All the code is Open Source and it's all Python and Django which are very popular tools.

A Django cache_page on steroids

Django ships with an awesome view decorator called cache_page which is awesome. But a bit basic too.

What it does is that it stores the whole view response in memcache and the key to it is the URL it was called with including any query string. All you have to do is specify the length of the cache timeout and it just works.
Now, it's got some shortcomings which django-fancy-cache upgrades. These "steroids" are:

  1. Ability to override the key prefix with a callable.
  2. Ability to remember every URL that was cached so you can do invalidation by a URL pattern.
  3. Ability to modify the response before it's stored in the cache.
  4. Ability to ignore certain query string parameters that don't actually affect the view but does yield a different cache key.
  5. Ability to serve from cache but always do one last modification to the response.
  6. Incrementing counter of every hit and miss to satisfy your statistical curiosity needs.

The documentation is here:
https://django-fancy-cache.readthedocs.org/

You can see it in a real world implementation by seeing how it's used on my blog here. You basically use it like this::

from fancy_cache import cache_page

@cache_page(60 * 60)
def myview(request):
    ...
    return render(request, 'template.html', stuff)

What I'm doing with it here on my blog is that I make the full use of caching on each blog post but as soon as a new comment is posted, I wipe the cache by basically creating a new key prefix. That means that pages are never cache stale but the views never have to generate the same content more than once.

I'm also using django-fancy-cache to do some optimizations on the output before it's stored in cache.

Remember mincss from last month? Well, despite it's rather crazy version number has only really had one major release. And it's never really been optimized.

So I took some metrics and was able to find out where all the time is spent. It's basically in this:

for body in bodies:
    for each in CSSSelector(selector)(body):
        return True

That in itself, on its own, is very fast. Just a couple of milliseconds. But the problem was that it happens so god damn often!

So, in version 0.8 it now, by default, first make a list (actually, a set) of every ID and every CLASS name in every node of every HTML document. Then, using this it gingerly tries to avoid having to use CSSSelector(selector) if the selector is quite simple. For example, if the selector is #container form td:last-child and if there is no node with id container then why bother.
It equally applies the same logic to classes.

And now, what you've all been waiting for; the results:

On a big document (20Kb) like my home page...

  1. BEFORE: 4.7 seconds

  2. AFTER: 0.85 seconds

(I ran it a bunch of times and averaged the times which had very little deviation)

So in the first round of optimization it suddenly becomes 500% faster. Pretty cool!

I've made it possible to switch this off just because I haven't yet tested it on equally many sites. All the unit tests pass of course.

Remember mincss from a couple of days ago? Now it supports downloading the HTML, to analyze, using PhantomJS. That's pretty exciting because PhantomJS actually supports Javascript. It's a headless (a web browser without a graphical user interface) Webkit engine. What mincss does is that invokes a simple script like this:

var page = require('webpage').create();
page.open(phantom.args[0], function () {
  console.log(page.content);
  phantom.exit();
});

which will allow any window.onload events to fire which might create more DOM nodes. So, like in this example it'll spit out HTML that contains a <p class="bar"> tag which you otherwise wouldn't get with Python's urllib.urlopen().

The feature was just added (version 0.6.0) and I wouldn't be surprised if there are dragons there because I haven't tried it on a lot of sites. And at the time of writing, I was not able to compile it on my Ubuntu 64bit server so I haven't put it into production yet.

Anyway, with this you can hopefully sprinkle less of those /* no mincss */ comments into you CSS.

First of all, to find out what mincss is read this blog post which explains what the heck this new Python tool is.

My personal website is an ideal candidate for using mincss because it uses an un-customized Bootstrap CSS which weighs over 80Kb (minified) and on every page hit, the rendered HTML is served directly from memcache so dynamic slowness is not a problem. With that, what I can do is run mincss just before the rendered (from Django) output HTML is stored in memcache. Also, what I can do is take ALL inline style blocks and all link tags and combine them into one big inline style block. That means that I can reduce any additional HTTP connections needed down to zero! Remember, "Minimize HTTP Requests" is the number one web performance optimization rule.

To get a preview of that, compare http://www.peterbe.com/about with http://www.peterbe.com/about3. Visually no difference. But view the source :)

Before:
Document size: Before

After:
Document size: After

Voila! One HTTP request less and 74Kb less!

Now, as if that wasn't good enough, let's now take into account that the browser won't start rendering the page until the HTML and ALL CSS is "downloaded" and parsed. Without further ado, let's look at how much faster this is now:

Before:
Waterfall view: Before
report

After:
Waterfall view: After
report

How cool is that! The "Start Render" event is fired after 0.4 seconds instead of 2 seconds!

Note how the "Content Download" isn't really changing. That's because no matter what the CSS is, there's still a tonne of images yet to download.

That example page is interesting too because it contains a piece of Javascript that is fired on the window.onload that creates little permalink links into the document and the CSS it needs is protected thanks to the /* no mincss */ trick as you can see here.

The code that actually implements mincss here is still very rough and is going to need some more polishing up until I publish it further.

Anyway, I'm really pleased with the results. I'm going to tune the implementation a bit further and eventually apply this to all pages here on my blog. Yes, I understand that the CSS, if implemented as a link, can be reused thanks to the browser's cache but visitors of my site rarely check out more than one page. In fact, the number of "pages per visit" on my blog is 1.17 according to Google Analytics. Even if this number was bigger I still think it would be a significant web performance boost.

UPDATE

Steve Souders points out a flaw in the test. See his full comment below. Basically, what appears to happen in the first report, IE8 downlads the file c98c3dfc8525.css twice even though it returns as a 200 the first time. No wonder that delays the "Start Render" time.

So, I re-ran the test with Firefox instead (still from the US East coast):

Before:
WebpageTest before (Firefox)
report

After:
WebpageTest after (Firefox)
report

That still shows a performance boost from 1.4 seconds down to 0.6 seconds when run using Firefox.

Perhaps it's a bug in Webpagetest or perhaps it's simply how IE8 works. In a sense it "simulates" the advantages of reducing the dependency on extra HTTP requests.

A project I started before Christmas (i.e. about a month ago) is now production ready.

mincss (code on github) is a tool that when given a URL (or multiple URLs) downloads that page and all its CSS and compares each and every selector in the CSS and finds out which ones aren't used. The outcome is a copy of the original CSS but with the selectors not found in the document(s) removed. It goes something like this:

>>> from mincss.processor import Processor
>>> p = Processor()
>>> p.process_url('http://www.peterbe.com')
>>> p.process()
>>> p.inlines
[]
>>> p.links
[<mincss.processor.LinkResult object at 0x10a3bbe50>, <mincss.processor.LinkResult object at 0x10a4d4e90>]
>>> one = p.links[0]
>>> one.href
'//d1ac1bzf3lrf3c.cloudfront.net/static/CACHE/css/c98c3dfc8525.css'
>>> len(one.before)
83108
>>> len(one.after)
10062
>>> one.after[:70]
u'header {display:block}html{font-size:100%;-webkit-text-size-adjust:100'

To whet your appetite, running it on any one of my pages here on my blog it goes from: 82Kb down to 7Kb. Before you say anything; yes I know its because I using a massive (uncustomized) Twitter Bootstrap file that contains all sorts of useful CSS that I'm not using more than 10% of. And yes, those 10% on one page might be different from the 10% on another page and between them it's something like 15%. Add a third page and it's 20% etc. But, because I'm just doing one page at a time, I can be certain it will be enough.

One way of using mincss is to run it on the command line and look at the ouput, then audit it and give yourself an idea of selectors that aren't used. A safer way is to just do one page at a time. It's safer.

The way it works is that it parses the CSS payload (from inline blocks or link tags) with a relatively advanced regular expression and then loops over each selector one at a time and runs it with cssselect (which uses lxml) to see if the selector is used anywhere. If the selector isn't used the selector is removed.

I know I'm not explaining it well so I put together a little example implementation which you can download and run locally just to see how it works.

Now, regarding Javascript and DOM manipulations and stuff; there's not a lot you can do about that. If you know exactly what your Javascript does, for example, creating a div with class loggedin-footer you can prepare your CSS to tell mincss to leave it alone by adding /* no mincss */ somewhere in the block. Again, look at the example implementation for how this can work.

An alternative is to instead of using urllib.urlopen() you could use a headless browser like PhantomJS which will run it with some Javascript rendering but you'll never cover all bases. For example, your page might have something like this:

$(function() {
  $.getJSON('/is-logged-in', function(res) {
    if (res.logged_in) {
      $('<div class="loggedin-footer">').appendTo($('#footer'));
    }
  });
});

But let's not focus on what it can not do.

I think this can be a great tool for all of us who either just download a bloated CSS framework or you have a legacy CSS that hasn't been updated as new HTML is added and removed.

The code is Open Source (of course) and patiently awaiting your pull requests. There's almost full test coverage and there's still work to be done to improve the code such as finding more bugs and optimizing.

Using the proxy with '?MINCSS_STATS=1'
Also, there's a rough proxy server you can start that attempts to run it on any URL. You start it like this:

pip install Flask
cd mincss/proxy
python app.py

and then you just visit something like http://localhost:5000/www.peterbe.com/about and you can see it in action. That script needs some love since it's using lxml to render the processed output which does weird things to some DOM elements.

I hope it's of use to you.

UPDATE

Published a blog post about using mincss in action

UPDATE 2

cssmin now supports downloading using PhantomJS which means that Javascript rendering will work. See this announcement

UPDATE 3

Version 0.8 is 500% faster now for large documents. Make sure you upgrade!

Here's a business idea that I've not seen implemented and which I likely won't have time to attempt:

An app for statistically figuring out which car you should buy.

Like Hot or Not it shows you one car at a time (at random) with a variable (also at random). The variable will be turned into a question. The question will be something like: "What about the price of this?" and it's a picture of a Toyota Prius 2013 with its price. Three buttons to choose: "Too expensive", "About right", "Too cheap".

Next, it's a different car and a different variable. For example, a Volvo XC90 with the question "What about the looks of this?" and, again, three buttons: "Too ugly", "About right", "Too sexy".

Car salesman
On so on... You can keep going, answering more questions, or you can stop and check out your result. Obviously, the more you answer the better the suggestion. You might want to help the user with this so they don't answer too few.

Then when you present the result you can, on that page, show a bunch of affiliate links to various local dealerships where you can buy the ideal car for you. Additionally, if the app becomes successful I'm sure you can easily sell advertisement to car companies who would love to show their ads depending on certain variables. E.g. Honda Fits for those who answer that they want low MPG and small cars.

The algorithm shouldn't be too hard to figure out. I'm sure you can get a lot of mileage just by doing a weighted average on the totals. If you sit down and think about it some more I'm sure you can fit some better established algorithm or something from the neural networks if you lay out your results as a matrix.

That's about it. I don't know where to get the pictures and specs for each car but I'm sure one can scrape from various sites and/or seed some of it manually.

It's the kind of app where you can start small (assuming you have at least 100 cars and 3-6 facts about each car). Also, it doesn't depend on having a bunch of traffic already so you don't need to worry so much about the chicken & egg predicament.

Do you think it could fly?

If the number 1 rule for making faster websites is to "Minimize HTTP Requests", then, let's try it.

On this site, almost all pages are served entirely from memcache. Django renders the template with the database content and the generated HTML is cached. So I thought I insert a little post processing script that converts all <img src="...something..."> into <img src="..."> which basic means the HTML gets as fat as the sum of all referenced images combined.

It's either 10Kb HTML followed by (rougly) 10 x 30Kb images or it's 300Kb HTML and 0 images. The result is here: http://www.peterbe.com/about2 (open and view source)

You can read more about the Data URI scheme here if you're not familiar with how it works.

The code is a hack but that's after all what a personal web site is all about :)

So, how much slower is it to serve? Well, actual server-side render time is obviously slower but it's a process you only have to do a small fraction of the total time since the HTML can be nicely cached.

Running..
ab -n 1000 -c 10 http://www.peterbe.com/about

BEFORE:

Document Path:          /about
Document Length:        12512 bytes

Concurrency Level:      10
Time taken for tests:   0.314 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      12779000 bytes
HTML transferred:       12512000 bytes
Requests per second:    3181.36 [#/sec] (mean)
Time per request:       3.143 [ms] (mean)
Time per request:       0.314 [ms] (mean, across all concurrent requests)
Transfer rate:          39701.75 [Kbytes/sec] received

AFTER:

Document Path:          /about2
Document Length:        306965 bytes

Concurrency Level:      10
Time taken for tests:   1.089 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      307117000 bytes
HTML transferred:       306965000 bytes
Requests per second:    918.60 [#/sec] (mean)
Time per request:       10.886 [ms] (mean)
Time per request:       1.089 [ms] (mean, across all concurrent requests)
Transfer rate:          275505.06 [Kbytes/sec] received

So, it's basically 292Mb transferred instead of 12Mb in the test and the requests per second is a third of what it used to be. But it's not too bad. And with web site optimization, what matters is the individual user's impression, not how much or how little the server can serve multiple users.

Next, how does the waterfall of this look?

BEFORE:

WebPagetest WebpageTest before

Pingdom Tools Pingdom Tools before

AFTER:

WebPagetest WebpageTest after

Pingdom Tools Pingdom Tools after

Note! All images when served individually (the "before" version) are all served from a fast CDN. The HTML is served from London, United Kingdom and the Webpagetest was run from Virginia, USA.

What can we conclude from this:

  • It worked! There are less requests. 18 requests becomes 6 requests.
  • The "Start Render" time is significantly started earlier.
  • The "Document Complete" event happens slightly earlier
  • The total file size goes from 286Kb to 283Kb!
  • Before: First load takes 2 seconds, repeated view takes 0.4 seconds
  • After: First load takes 2 seconds, repeated view takes 2 secondsd :(
  • Pingdom Tools sums the kilobytes which gives a rounding error compared to WebPagetest

Some more thoughts and conclusions:

If you're wondering how the total file size is the same as before (sum of html + images) it's because all images are turned into base64 into one large document which gzip presumably does better on. If there were fewer images I'd suspect the second version would be slightly bigger in total.
Apparently the base64 version + gzip is supposed to be 2-5% bigger than the original JPG/PNG individually.

Don't do this at home kids if you don't have a good server-side cache and a good web server that serves the HTML gzipped.

Although the code I put in place to make this possible is, right now, pretty ugly it is after all pretty convenient to the developer because it's like a plugin you just add to the rendering. You don't even notice this going on in the template or in the view code. However...

More work is needed. And that is the IE <= 7 guys. Basically Internet Explorer 7 and worse don't support it at all so you need a shim for them that looks something like this:

<!--[if lt IE 7]> 
<script>
$('img').each(function() {
  $(this).attr(src, $(this).data('orig-src'));
})
</script>
<![endif]-->


It would need some love and work but the principle is there and it's sound.

Or, just ignore them. After all, only 3% of my visitors are on IE8 and only 0.5% are on IE7. At least they can read the text. This brutal exclusion isn't always and option. But the shim is.

I think I'm going to keep it. The code needs to be packaged up and made neat before I stick to it. There is a lot more interesting things one can do with this. For example, you could in a post processor optimize the CSS used by inspecting the DOM to see which selectors can be dumped.

UPDATE

Some really valuable comments below have pointed out that using data URIs cause a memory bloat in Gecko which means that it might be particularly harmful for people with multiple tabs or using mobile devices.

Hmm... back to the drawing board a bit I guess.