Peterbe.com

A blog and website by Peter Bengtsson

tl;dr I use bgg to shortcut a lot of tedious git commands.

Once a certain pattern appears where you find yourself doing the same thing over and over the first thing that should spring to mind is: let's automate that!

So a couple of years ago I started writing simple Python scripts that would wrap various git operations so I could do things like G merge or G rebase. That has helped me tremendously and when I at first showed these scripts to some people I was amazed how unimpressed they were. I guess that's because they have their own scripts or a geeky reluctance to adopting someone elses shortcuts unless you've personally be apart of going from tedious to shortcut.

So, a crucial part of my work here at Mozilla is to look at a Bugzilla and start a topic branch based on it and when it's done, push that into a Pull Request on GitHub.

The first command is G start. It takes a single optional argument. If an argument is provided it has to be a Bugzilla bug number. If you supply a Bugzilla ID it will fetch the title of that bug (assuming you're online) and store that so that it can be used to mention it in the git commit message. For example:

(airmozilla):~/dev/MOZILLA/AIRMOZILLA/airmozilla (master)$ G start 1174316
You're currently on branch master
Summary ["Start duration fetching when stopping a live event"]:
Switched to a new branch 'bug-1174316-start-duration-fetching-when-stopping-a-live-event'

The git branch name becomes a "slugified" version of the bug summary. But note, it merely sets the default. I could override it if I want to.

Then you do some work on it and when you're done you type the next command; G commit. It basically runs git commit -a -m "..." using the bug number, the bug summary, optionally asking if you want to prefix the commit message with fixes and then pushed it to your fork. Example speaks for itself:

(airmozilla):~/dev/MOZILLA/AIRMOZILLA/airmozilla (bug-1174316-start-duration-fetching-when-stopping-a-live-event *)$ G commit
MSG:
    bug 1174316 - Start duration fetching when stopping a live event

OK? [Y/n]
Add the 'fixes ' prefix? [N/y] y
NOW, feel free to run:

git checkout master
git merge bug-1174316-start-duration-fetching-when-stopping-a-live-event
git branch -d bug-1174316-start-duration-fetching-when-stopping-a-live-event

OR

git push peterbe bug-1174316-start-duration-fetching-when-stopping-a-live-event

Run that push? [Y/n]
To git@github.com:peterbe/airmozilla.git
 * [new branch]      bug-1174316-start-duration-fetching-when-stopping-a-live-event -> bug-1174316-start-duration-fetching-when-stopping-a-live-event

You get the picture. It's interactive and mostly you just hit enter and it does stuff saving you copious milliseconds.

Other noteworthy commands:

G rebase - whilst on a branch, jumps over to the master branch, updates from the origin, then goes back to the branch you were on preparing you for an interactive git rebase.

G merge - goes over to the master branch, merges the branch you were on and if it works out, deletes the branch.

G getback - you're in a branch you know was merged (using GitHub's green merge button), it switches to the master branch, updates master and deletes the local topic branch (that was merged) and deletes the remote topic branch on your fork.

G cleanup [search] - you're on some other branch other than the one you search for. It finds that branch (if only 1 match) and does that G getback does.

G branches [search] - lists all your branches sorted by most recently worked on last also indicate how long ago you worked on it and if it has already been merged.

The reason I'm mentioning this isn't to convince you to use my tool to do your git but perhaps to inspire you to write your own scripts that wrap things you find yourself doing repetitively.

I know my own battle isn't over. I'm still finding things that I have to do additionally on an almost perfectly predictable basis. Thankfully I now have an infrastructure to add more scripting.

It's the old problem of "Do I seek permission or ask for forgiveness?". It's rarely easy to know which one to use in Python because working with exceptions in Python is so damn easy.

Generally I prefer neither. I.e. just do. Don't write defensive code if you don't have to. Only seek permission or ask for forgiveness if you expect it to happen and that that's normal.

Consider the following three functions:

def f0(x):
    return PI / x


def f1(x):
    if x != 0:
        return PI / x
    else:
        return -1


def f2(x):
    try:
        return PI / x
    except ZeroDivisionError:
        return -1

Which one do you think is the fastest? If I run this 1,000,000 times and never pass in a value for x=0 will it make any difference?

Before you look at it, what do you think the result will be?


The answer is below.


Read on.


Scroll down for the results.


Have you made a guess yet?


What do you think it's going to be?


Scroll some more.


Almost there!


Ok, the results are as follows when running each of the above mentioned functions ~33,000,000 times on my MacBook:

f0 4.16087803245
f1 4.84187698364
f2 4.73760977387
(smaller is better)

Conclusion, the difference is miniscule. The fastest is to not do any exception handling or condition checking but it's generally no big difference.

This test was done with Python 2.7.9. You can try the code for yourself.

Just one more thought

As I wrote this post I started thinking more and more about the "code style aspect" rather than the performance.

Basically, I think it boils down to the following rules:

  1. If you're working with external I/O (e.g. network or a database) use the "ask for forgiveness" approach (aka. exception wrapping). I.e. don't do if requests.head(url).status_code == 200: stuff = requests.get(url)

  2. If you want to make a really user-friendly Python API, use the "seek permission" approach (aka. if-statement first). E.g. def calculate(guests): if isinstance(guests, basestring): guests = [guests]

  3. All else just do. That makes the code more Pythonic. If you have a sub-routine that sends in variable of the totally crazy-wrong type to your function, don't change the function, change the sub-routine.

UPDATE

Here are the numbers for PyPy:

f0 0.369750552707
f1 0.321069081624
f2 0.411438703537
(smaller is better)

That's after averaging 15 runs of the script.

Note that the function with the extra if statement is faster.

And here are the numbers of Python 3.4.2:

f0 4.99579153742
f1 5.77459328515
f2 5.38382162367
(smaller is better)

That's averaging 10 rounds.

One almost interesting thing about these numbers is that the sum of them are different and tells us a tiny story about performance for the language:

Python 2.7.9   13.74036478996
PyPy 2.4.0     1.102258337868
Python 3.4.2   16.15420644624
(smaller is better)

UPDATE 2

Here's the node equivalent version and its times:

f0 0.215509441
f1 0.228280196357
f2 0.316222934714
(smaller is better)

That means that my Node v0.10.35 is 45% faster than PyPy. But please, don't take that seriously.

I just pushed out a new release of premailer which comes with a pretty big change.

What it means is that the way the base_url and any href= or src= gets combined. For example, you used to be able to set Premailer(html, base_url='http://example.com/subfolder') and combined with <img src="//d1ac1bzf3lrf3c.cloudfront.net/CONTENTCACHE-1431361865/images/foo.png"> it would become <img src="http://example.com/subfolder/images/foo.png">.

Not any more. The joining works exactly like the Python built-in urljoin() works. E.g.

>>> from urllib.parse import urljoin  # python 3
>>> urljoin('https://example.com', '/image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder', '/image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder/', '/image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder/', '//image.png')
'https://image.png'
>>> urljoin('https://example.com/subfolder/', '//mycdn.com/image.png')
'https://mycdn.com/image.png'
>>> urljoin('http://example.com/subfolder/', '//mycdn.com/image.png')
'http://mycdn.com/image.png'
>>> urljoin('https://example.com/subfolder', 'image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder/', 'image.png')
'https://example.com/subfolder/image.png'

So basically, if you think you tried to do something odd with your base_url check it over carefully when you upgrade to version 2.9.0.

Thank you @ewjoachim and @graingert for your help!

The idea with template context processors in Django is to inject some defaults thing to be available when rendering a template that is rendered with a request.

I.e. instead of...:

def view1(request):
    context = {
        'name': 'View 1', 
        'on_dev_server': request.get_host() in settings.DEV_HOSTNAMES
    }
    return render(request, 'view1.html', context)

def view2(request):
    context = {
        'name': 'View 2', 
        'other': 'things', 
        'on_dev_server': request.get_host() in settings.DEV_HOSTNAMES
    }
    return render(request, 'view2.html', context)

And in your nominal templates/base.html you might have something like this:

  ...
  <footer>
  <p>&copy; You 2015</p>
  {% if on_dev_server %}
    <p color="red">Note! We're currently on a dev server!</p>
  {% endif %}
  </footer>
  ...

Instead you do this trick; in your settings.py you write down the list of defaults plus the one you want to always have available:

TEMPLATE_CONTEXT_PROCESSORS = (
    "django.contrib.auth.context_processors.auth",
    "django.template.context_processors.static",
    "myproject.myapp.context_processors.debug_info",
)

And to accompany that you define your myprojects/myapp/context_processors.py like so:

def debug_info(request):
    return {
        'on_dev_server': request.get_host() in settings.DEV_HOSTNAMES,
    }

So far so good.

However, there's a problem with this. Two problems in fact.

First problem is that when all the templates in your big complicated website renders, it's quite possible that some pages don't need everything you set up in your context processors. That might mean a heck of a lot of extra computation when it won't ever be displayed.

For example, I have a project where most pages have a sidebar where I show "Trending Events" which is something I compute in a context_processors.py function called def sidebar_events(request):. But the sidebar is not always shown and on the pages where it's not shown it's a waste to compute the stuff that sidebar_events computes. Also, I have management pages which uses a totally different base.html template. So there's a big chance you're wasting precious CPU.

Another problem is that of code-readability (aka. how frustrating is this to debug for someone else or yourself after months of idle activity). If you're skimming through your base.html and you see this "random" variable called on_dev_server it's very very hard to tell where the heck that's defined. Hopefully grepping the whole source code is a way to go. A much better way to solve that problem would be sensible namespace naming.

And also, by being too liberal with globally scoped variables there's a chance you might clash from a different piece of functionality that uses the same variable names. That chance is smaller when you use namespaces.

So, to remedy this, let your template context processor functions return closures. It wraps the request automagically.

Let's rewrite our trivial example from above, the context_processors.py should now look like this:

def debug_info(request):
    def inner():
        return {
            'on_dev_server': request.get_host() in settings.DEV_HOSTNAMES,
        }
    return {'debug_info': inner}

Now executing that becomes more optional and more deliberate in the template instead. E.g.

  ...
  <footer>
  <p>&copy; You 2015</p>
  {% set debug_info = debug_info() %}
  {% if debug_info['on_dev_server'] %}
    <p color="red">Note! We're currently on a dev server!</p>
  {% endif %}
  </footer>
  ...

This makes it more explicity which is a good thing. It also has the potential to be avoided if the stuff in there isn't needed in some templates.

Starting today, (almost) all the thumbnails below the fold on Air Mozilla are not loaded.

The way it works, is that I use a library called Lazyr.js which notices when you scroll down and when certain pictures are going to be in view, it changes the <img> tag's src.

So it basically looks like this:

<article>
  <h3>Event 1</h3>
  <img src="event1.png">
</article>

<article>
  <h3>Event 2</h3>
  <img src="event2.png">
</article>

<article>
  <h3>Event 3</h3>
  <img src="event3.png">
</article>

<article>
  <h3>Event 4</h3>
  <img src="placeholder.png" data-lazyr="event4.png">
</article>

<article>
  <h3>Event 5</h3>
  <img src="placeholder.png" data-lazyr="event5.png">
</article>

<article>
  <h3>Event 6</h3>
  <img src="placeholder.png" data-lazyr="event6.png">
</article>

That means that to load this page it needs to download, only:

event1.png
event2.png
event3.png
placeholder.png

Only 4 images instead of the otherwise 6 (in this example).

When you scroll down to see the rest of the list, it then also downloads:

event4.png
event5.png
event6.png

The actual numbers on Air Mozilla is that there are 10 events page page and I lazy load 6 of them.

You can see the results when comparing this WebPageTest with this one.

There is more work to do though. At the moment, the thumbnails in the sidebar (Trending and Upcoming events) are above the fold when you're browsing but below the fold when you're viewing an individual event. That's something I have yet to implement.

Something tells me there are already solutions like this out there that are written by much smarter people who have tests and package.json etc. Perhaps my Friday-brain failed at googling them up.

So, the issue I'm having is an angular app that uses a ui-router to switch between controllers.

In almost every controller it looks something like this:

app.controller('Ctrl', function($scope, $http) {
  /* The form that needs this looks something like this:
      <input name="first_name" ng-model="stuff.first_name">
   */
  $scope.stuff = {};
  $http.get('/stuff/')
  .success(function(response) {
    $scope.stuff = response.stuff;
  })
  .error(function() {
    console.error.apply(console, arguments);
  });
})

(note; ideally you push this stuff into a service, but doing it here in the controller illustrates what matters in this point)

So far so good. But so far so slow.

Every time the controller is activated, the AJAX GET is fired and it might be slow because of network latency.
And I might switch to this controller repeatedly within one request/response session of loading the app.

So I wrote this:

app.service('localProxy',
    ['$q', '$http', '$timeout',
    function($q, $http, $timeout) {
        var service = {};
        var memory = {};

        service.get = function(url, store, once) {
            var deferred = $q.defer();
            var already = memory[url] || null;
            if (already !== null) {
                $timeout(function() {
                    if (once) {
                        deferred.resolve(already);
                    } else {
                        deferred.notify(already);
                    }
                });
            } else if (store) {
                already = sessionStorage.getItem(url);
                if (already !== null) {
                    already = JSON.parse(already);
                    $timeout(function() {
                        if (once) {
                            deferred.resolve(already);
                        } else {
                            deferred.notify(already);
                        }
                    });
                }
            }

            $http.get(url)
            .success(function(r) {
                memory[url] = r;
                deferred.resolve(r);
                if (store) {
                    sessionStorage.setItem(url, JSON.stringify(r));
                }
            })
            .error(function() {
                deferred.reject(arguments);
            });
            return deferred.promise;
        };

        service.remember = function(url, data, store) {
            memory[url] = data;
            if (store) {
                sessionStorage.setItem(url, JSON.stringify(data));
            }
        };

        return service;
    }]
)

And the way you use it is that it basically returns twice. First from the "cache", then from the network request response.

So, after you've used it at least once, when you request data from it, you first get the cached stuff (from memory or from the browser's sessionStorage) then a little bit later you get the latest and greatest response from the server. For example:

app.controller('Ctrl', function($scope, $http, localProxy) {
  $scope.stuff = {};
  localProxy('/stuff/')
  .then(function(response) {
    // network response
    $scope.stuff = response.stuff;
  }, function() {
    // reject/error
    console.error.apply(console, arguments);
  }, function(response) {
    // update
    $scope.stuff = response.stuff;
  });
})

Note how it sets $scope.stuff = response.stuff twice. That means that the page can load first with the cached data and shortly after the latest and greatest from the server.
You get to look at something whilst waiting for the server but you don't have to worry too much about cache invalidation.

Sure, there is a risk. If your server response is multiple seconds slow, your user might for example, start typing something into a form (once it's loaded from cache) and when the network request finally resolves, what xhe typed in is overwritten or conflicting.

The solution to that problem is that you perhaps put the form in a read-only mode until the network request resolves. At least you get something to look at sooner rather than later.

The default implementation above doesn't store things in sessionStorage. It just stores it in memory as you're flipping between controllers. Alternatively, you might want to use a more persistent approach so then you instead use:

controller( // same as above
  localProxy('/stuff/', true)
  // same as above
)

Sometimes there's data that is very unlikely to change. Perhaps you just need the payload for a big drop-down widget or something. In that case, it's fine if it exists in the cache and you don't need a server response. Then set the third parameter to true, like this:

controller( // same as above
  localProxy('/stuff/', true, true)
  // same as above
)

This way, it won't fire twice. Just once.

Another interesting expansion on this is, if you change the data after it comes back. A good example is if you request data to fill in a form that user updates. After the user has changed some of it, you might want to pre-emptivly cache that too. Here's an example:

app.controller('Ctrl', function($scope, $http, localProxy) {
  $scope.stuff = {};
  var url = '/stuff/';
  localProxy(url)
  .then(function(response) {
    // network response
    $scope.stuff = response.stuff;
  }, function() {
    // reject/error
    console.error.apply(console, arguments);
  }, function(response) {
    // update
    $scope.stuff = response.stuff;
  });

  $scope.save = function() {
      // update the cache
      localProxy.remember(url, $scope.stuff); 
      $http.post(url, $scope.stuff);
  };
})

What do you think? Is it useful? Is it "bonkers"?

I can think of one possible beautification, but I'm not entirely sure how to accomplish it.
Thing is, I like the API of $http.get that it returns a promise with a function called success, error and finally. The ideal API would look something like this:

app.controller('Ctrl', function($scope, $http) {
  $scope.stuff = {};
  // angular's $http service expanded
  $http.getLocalProxy('/stuff/')
  .success(function(cached, response) {
    /* Imagine something like:
        <p class="warning" ng-if="from_cache">Results you see come from caching</p>
     */
    $scope.from_cache = cached;
    $scope.stuff = response.stuff;
  })
  .error(function() {
    console.error.apply(console, arguments);
  });
})

That API looks and feels just like the regular $http.get function but with an additional first argument to the success promise callback.

Now that Autocompeter.com is launched I can publish some preliminary benchmarks of "real" usage. It's all on my MacBook Pro on a local network with a local Redis but it's quite telling that it's pretty fast.

What I did was I started with a completely empty Redis database then I did the following things:

First of all, I bulk load in 1035 "documents" (110Kb of data). This takes about 0.44 seconds consistently!

  1. GET on the home page (not part of the API and thus quite unimportant in terms of performance)
  2. GET on a search with a single character ("p") expecting 10 results (e.g. /v1?d=mydomain&q=p)
  3. GET on a search with a full word ("python") expecting 10 results
  4. GET on a search with a full word that isn't in the index ("xxxxxxxx") expecting 0 results
  5. GET on a search with two words ("python", "te") expecting 4 results
  6. GET on a search with two words that aren't in the index ("xxxxxxx", "yyyyyy") expecting 0 results

In each benchmark I use wrk with 10 connections, lasting 5 seconds, using 5 threads.

And for each round I try with 1 processor, 2 processors and 8 processors (my laptop's max according to runtime.NumCPU()).

I ran it a bunch of times and recorded the last results for each number of processors.
The results are as follows:

Autocompeter Benchmark v1

Notes

  • Every search incurs a write in the form of incrementing a counter.
  • Searching on more than one word causes an ZINTERSTORE.
  • The home page does a bit more now since this benchmark was made. In particular looking for a secure cookie.
  • Perhaps interally Redis could get faster if you run the benchmarks repeatedly after the bulk load so it's internals could "warm up".
  • I used a pool of 100 Redis connections.
  • I honestly don't know if 10 connections, 5 seconds, 5 threads is an ideal test :)

Basically, this is a benchmark of Redis more so than Go but it's quite telling that running it in multiple processors does help.

If you're curious, the benchmark code is here and I'm sure there's things one can do to speed it up even more. It's just interesting that it's so freakin' fast out of the box!

In particular I'm very pleased with the fact that it takes less than half a second to bulk load in over 1,000 documents.

(For context, I released Autocompeter.com last week and now I'm thinking about improvements)

I posted a question on Twitter about which highlighting formatting people prefer and got some interesting feedback. More about that later.

The piece of feedback that really got my attention came from my friend Honza Král.
He wondered if not the whole word should be highlighted instead of just the beginning of the word.

I've actually been thinking about that too but never got around to trying it out. Until now.

Before

Before

After

After

What do you think?

I have the code in a branch and I'm still mulling it over. There's sort of a convention to just highlight based on what you've typed so far. I don't want to be too weird because when people don't feel familiar they don't like what they see even if the new actually is better.

For Autocompeter I develop with gulp. It's like Grunt but better.

One thing I wanted was that when it makes the src/autocompeter.js --> minify() --> dist/autocompeter.min.js step I also wanted to put in a little preample header into the minified file.

First I thought, since UglifyJS supports a --preamble option that that'd be the route to go. I didn't get very far.

Then I thought I had to write my own plugin. So I started reading the documentation about how to write a plugin and partially thinking "Oh I don't have time to do this" and also "Oh finally a chance to sit down and really understand how gulp plugins work". I was wrong. The documentation for writing plugins say:

"Your plugin shouldn't do things that other plugins are responsible for... ...It should not add headers, gulp-header does that"

Oh! So there is already a great plugin for this! Long story short; here's how I used it. The output is that the version number is now on the first line of autocompeter.min.js.

I'm starting to like gulp more and more. There's even a dedicate nice index of all available plugins.

One of the most constructive pieces of feedback I got when Autocompeter was on Hacker News was that when you type something with lots of predictable results the results overlay would "flicker".

E.g. you type "javascript" in a nice and steady pace and the overlay would shrink and grow and shrink and grow very rapidly.

The reason it happened was due to a bug in the javascript code that filtered results whilst waiting for the next AJAX request from the latest typed character. E.g. you type "ash" and the results comes back with "ashley", "ashes", "Ashford". Then you add a "l" so now we start a new AJAX query for "ashl" and whilst waiting for that output from the server we can start filtering out "ashes" and "Ashford" because we can pre-emptively know that that won't be in the new result set.

The bug was a bad function that filtered the existing results on a second rendering whilst waiting for the next AJAX. It was easy to fix and this is included in version 1.1.8.

The reason I failed to notice this was because I had inserted some necessary optimizations when the network latency was very very slow but hadn't tested it in a realistic network latency environment. E.g. a decent DSL connection but nevertheless something more advanced that just connecting to localhost.