Peterbe.com

A blog and website by Peter Bengtsson

tl;dr Don't run ffmpeg over HTTP(S) and use ffmpegthumbnailer

UPDATE tl;dr Download the file then run ffmpeg with -ss HH:MM:SS first. Don't bother with ffmpegthumbnailer

At work I work on something called Air Mozilla. It's a site for hosting live video broadcasts and then archiving those so they can be retrieved later.

Unlike sites like YouTube we can't take a screencap from the video because many videos are future (aka. "upcoming") videos so instead we use a little placeholder thumbnail (for example, the Rust logo).

However, once it has been recorded we want to switch from the logo to an actual screen capture from the video itself. We set up a cronjob that uses ffmpeg to extract these as JPGs and then the users can go in and select whichever picture they like the best.

This is all work in progress by the way (as of December 2014).

One problem is that we have is that the command for extracting JPGs is really slow. So slow that we can't wrap the subprocess in a Django database connection because it's so slow that the database connection is often killed.

The command to extract them looks something like this:

ffmpeg -i https://cdnexample.com/url/to/file.mp4 -r 0.0143 /tmp/screencaps-%02d.jpg

Where the number r is based on the duration and how many pictures we want out. E.g. 0.0143 = 15 * 1049 where 15 is how many JPGs we want and 1049 is a duration of 17 minutes and 29 seconds.

The script I used first was: ffmpeg1.sh

My first experiment was to try to extract one picture at a time, hoping that way, internally, ffmpeg might be able to optimize something.

The second script I used was: ffmpeg2.sh

The third alternative was to try ffmpegthumbnailer which is an intricate wrapper on ffmpeg and it has the benefit that you can produce slightly higher picture quality too.

The third script I used was: ffmpeg3.sh

Bar chart comparing the 3 different scripts
And running these three depend very much on the state of my DSL at the time.

For a video clip that is 17 minutes long and a 138Mb mp4 file.

ffmpeg1.sh   2m0.847s
ffmpeg2.sh   11m46.734s
ffmpeg3.sh   0m29.780s

Clearly it's not efficient to do one screenshot at a time.
Because with ffmpegthumbnailer you can tell it not to reduce the picture quality the total weight of the produced JPGs from ffmpeg1.sh was 784Kb and the total weight from ffmpeg3.sh was 1.5Mb.

Just to try again, I ran a similar experiment with a 35 minutes long and 890Mb mp4 file. And this time I didn't bother with ffmpeg2.sh. The results were:

ffmpeg1.sh   18m21.330s
ffmpeg3.sh   2m48.656s

So that means that using ffmpegthumbnailer is about 5 times faster than ffmpeg. Huge difference!

And now, a curveball!

The reason for doing ffmpeg -i https://... was so that we don't have to first download the whole beast and run the command on a local file. However, in light of how so much longer this takes and my disdain to have to install and depend on a new tool (ffmpegthumbnailer) across all servers. Why not download the whole file and run the ffmpeg command locally.

So I download the file and it's slow because of my, currently, terrible home DSL. Then I run and time them again but just a local file instead:

ffmpeg1.sh   0m20.426s
ffmpeg3.sh   0m0.635s

Did you see that!? That's an insane difference. Clearly doing this command over HTTP(S) is a bad idea. It'll be worth downloading it first.

UPDATE

On Stackoverflow, LordNeckBeard gave a great tip of using the -ss option before in the input file and now it's much faster. At this point. I'm no longer interested in having to bother with ffmpegthumbnailer.

Let's fork ffmpeg2.sh into two versions.

ffmpeg2.1.sh same as ffmpeg2.sh but a downloaded file instead of a remote HTTPS URL.

ffmpeg2.2.sh as ffmpeg2.1.sh except we put the -ss HH:MM:SS before the input file.

Now, let's run them again on the 138Mb file:

# the 138Mb mp4.mp4 file
ffmpeg2.1.sh   2m10.898s
ffmpeg2.2.sh   0m0.672s

187 times faster

And again, I re-ran this again against a bigger file that is 1.4Gb:

# the 1.4Gb mp4-1.44Gb.mp4 file
ffmpeg2.1.sh   10m1.143s
ffmpeg2.2.sh   0m1.428s

420 times faster

You might have heard that AngularJS 1.3 has "one-time bindings" which is that you can print the value of a scope variable with {{ ::somevar }} and that this is really good for performance because it means that once rendered it doesn't add to the list of things that the angular app needs to keep worrying about. I.e. it's one less thing to watch.

But what's a good use case of this? This is a good example.

Because ng-if="true" will cause the DOM element to be re-created it will go back to the scope variable and re-evaluate it.

When I build hugepic.io one of the biggest challenges was to image resizing of enourmous images. Primarily JPEGs.

The way Hugepic works is that it chops up images into tiles, but before it can crop and chop of the tiles it needs to resize the image to a certain size. Say 1024x1024. Now this is really slow and it's so CPU intensive that if you try to parallelize it you end up causing so much "swappage" that the time it takes to resize to large images in parallel is more than it takes to do them one at a time.

The tool I found that was the best possible was ImageMagick's tool convert.

Now there's a new tool that is much faster: vipsthumbnail

There are more comprehensive benchmarks abound the net, like this one for example, but here's a quick one to wet your appetite:

$ ls -lh 8/04/84c3e9.jpg
-rw-r--r--@ 1 peterbe  staff   253M Sep 16 12:00 8/04/84c3e9.jpg

$ time convert 8/04/84c3e9.jpg -resize 200 /tmp/converted-200.jpg
real    0m9.423s
user    0m8.893s
sys     0m0.521s

$ time vipsthumbnail 8/04/84c3e9.jpg -s 200x200 -o /tmp/vips.jpg
real    0m3.209s
user    0m3.051s
sys     0m0.138s

It supposedly has ports for Python but I'm quite happy to just a subprocess out to the command. You can install it on OSX with brew install vips.

By writing this I'm taking a risk of looking like an idiot who has failed to read the docs. So please be gentle.

AngularJS uses a promise module called $q. It originates from this beast of a project.

You use it like this for example:

angular.module('myapp')
.controller('MainCtrl', function($scope, $q) {
  $scope.name = 'Hello ';
  var wait = function() {
    var deferred = $q.defer();
    setTimeout(function() {
      // Reject 3 out of 10 times to simulate 
      // some business logic.
      if (Math.random() > 0.7) deferred.reject('hell');
      else deferred.resolve('world');
    }, 1000);
    return deferred.promise;
  };

  wait()
  .then(function(rest) {
    $scope.name += rest;
  })
  .catch(function(fallback) {
    $scope.name += fallback.toUpperCase() + '!!';
  });
});

Basically you construct a deferred object and return its promise. Then you can expect the .then and .catch to be called back if all goes well (or not).

There are other ways you can use it too but let's stick to the basics to drive home this point to come.

Then there's the $http module. It's where you do all your AJAX stuff and it's really powerful. However, it uses an abstraction of $q and because it is an abstraction it renames what it calls back. Instead of .then and .catch it's .success and .error and the arguments you get are different. Both expose a catch-all function called .finally. You can, if you want to, bypass this abstraction and do what the abstraction does yourself. So instead of:

$http.get('https://api.github.com/users/peterbe/gists')
.success(function(data) {
  $scope.gists = data;
})
.error(function(data, status) {
  console.error('Repos error', status, data);
})
.finally(function() {
  console.log("finally finished repos");
});

...you can do this yourself...:

$http.get('https://api.github.com/users/peterbe/gists')
.then(function(response) {
  $scope.gists = response.data;
})
.catch(function(response) {
  console.error('Gists error', response.status, response.data);
})
.finally(function() {
  console.log("finally finished gists");
});

It's like it's built specifically for doing HTTP stuff. The $q modules doesn't know that the response body, the HTTP status code and the HTTP headers are important.

However, there's a big caveat. You might not always know you're doing AJAX stuff. You might be using a service from somewhere and you don't care how it gets its data. You just want it to deliver some data. For example, suppose you have an AJAX request cached so that only the first time it needs to do an HTTP GET but all consecutive times you can use the stuff already in memory. E.g. Something like this:

angular.module('myapp')
.controller('MainCtrl', function($scope, $q, $http, $timeout) {

  $scope.name = 'Hello ';
  var getName = function() {
    var name = null;
    var deferred = $q.defer();
    if (name !== null) deferred.resolve(name);
    $http.get('https://api.github.com/users/peterbe')
    .success(function(data) {
      deferred.resolve(data.name);
    }).error(deferred.reject);
    return deferred.promise;
  };

  // Even though we're calling this 3 different times
  // you'll notice it only starts one AJAX request.
  $timeout(function() {
    getName().then(function(name) {
      $scope.name = "Hello " + name;
    });    
  }, 1000);

  $timeout(function() {
    getName().then(function(name) {
      $scope.name = "Hello " + name;
    });    
  }, 2000);

  $timeout(function() {
    getName().then(function(name) {
      $scope.name = "Hello " + name;
    });    
  }, 3000);
});

And with all the other promise frameworks laying around like jQuery's you will sooner or later forget if it's success() or then() or done() and your goldfish memory (like mine) will cause confusion and bugs.

So is there a way to make $http.<somemethod> return a $q like promise but with the benefit of the abstractions that the $http layer adds?

Here's one such possible solution maybe:

var app = angular.module('myapp');

app.factory('httpq', function($http, $q) {
  return {
    get: function() {
      var deferred = $q.defer();
      $http.get.apply(null, arguments)
      .success(deferred.resolve)
      .error(deferred.resolve);
      return deferred.promise;
    }
  }
});

app.controller('MainCtrl', function($scope, httpq) {

  httpq.get('https://api.github.com/users/peterbe/gists')
  .then(function(data) {
    $scope.gists = data;
  })
  .catch(function(data, status) {
    console.error('Gists error', response.status, response.data);
  })
  .finally(function() {
    console.log("finally finished gists");
  });
});

That way you get the benefit of a one same way for all things that get you data some way or another and you get the nice AJAXy signatures you like.

This is just a prototype and clearly it's not generic to work with any of the shortcut functions in $http like .post(), .put() etc. That can maybe be solved with a Proxy object or some other hack I haven't had time to think of yet.

So, what do you think? Am I splitting hairs or is this something attractive?

A common thing in many (AngularJS) apps is to have an ng-model input whose content is used to as a filter on an ng-repeat somewhere within the page. Something like this:

<input ng-model="search">
<div ng-repeat="item in items | filter:search">...

Well, what if you want the search you make to automatically become part of the URL so that if you bookmark the search or copy the URL to someone else, the search is still there? It would be really practical. Granted, it's not always that you want this but that's something you can decide.

AngularJS 1.2 (I think) introduced the ability to set reloadOnSearch: false on a route provider and that means that you can do things like $location.hash('something') without it triggering the route provider to re-map the URL and re-start the revelant controller.

So here's a good example of (ab)using that to do a search filter which automatically updates the URL.

Check out the demo: http://www.peterbe.com/permasearch/index.html

This works in HTML5 mode too if you're wondering.

Suppose you use many more things in your filter function other than just a free text ng-modal. Like this:

<input type="text" ng-model="filters.search">
<select ng-model="filters.year">
<option value="">All</option>
<option value="2014">2014</option>
<option value="2013">2013</option>
</select>

You might have some checkboxes and stuff too. All you need to do then is to encode that information in the hash. Something like this might be a good start:

$scope.filters = {};
$scope.$watchCollection('filters', function(value) {
    $location.hash($.param(value)); // a jQuery function
});

And something like this to "unparse" the params.

Here’s an example of unescaped & characters in a A HREF tag attribute.
http://jsfiddle.net/32zbogfw/ It’s working fine.

I know it might break XML and possibly XHTML but who uses that still?

Red. So what?
And I know an unescaped & in a href shows as red in the View Source color highlighting.

What can go wrong? Why is it important? Perhaps it used to be in 2009 but no longer the case.

This all started because I was reviewing some that uses python urllib.urlencode(...) and inserts the results into a Django template with href="{{ result_of_that_urlencode }}" which would mean you get un-escaped & characters and then I tried to find how and why that is bad but couldn't find any examples of it.

God, No! by Penn Jillette
A couple of months ago my wife went to see the Penn & Teller show in Las Vegas. Afterwards she stayed backstage to meet Penn, have a quick chat and sign a copy of his book. My wife said "My husband is going to be so jealous that I met you", to which Penn replied "Wanna make him really jealous? Grab my ass." Which she did. Haha!

I've been a long time fan of their show. I remember watching it when I was big enough to appreciate magic but had no idea what the jokes were and I thought they was just kinda dark and odd.

These guys do everything together but this book is all Penn. It's completely without a plot line other than, I guess, it goes through the 10 commandments in the bible and for each, tells a couple of stories that are somewhat related. Funny stories. Sexy stories. And very very personal stories.

Despite its title not that much of the book is about atheism. The prolog and the epilogue is though. In fact, the prolog was "mindblowingly" profound and well written. I was really impressed. There were so many interesting thoughts that I could quote the whole thing but instead I'm just going to quote this little piece:

Some will tell you "God is love" and then defy you not to believe in love. Bug, if X = Y, why have a fucking X? Just keep it at Y. Why call love god? Why not call love ... love? "Beauty is god." Okay. If you change what the word means, you can get me to say I believe in it. Say "God is bacon" or "God is tits" and I'll love and praise god, but you're just changing the word, not the idea.

Funny! And I'd never thought of that as a rebuttal.

I used to be an atheist and was almost militant about it meaning; I was prone to proclaim it loudly in hope of convincing people. I am no longer an atheist. Partly that's because I've come to understand two things: Preaching for the negative is a paradoxical oxymoron. Secondly, I have new-found respect and admiration for church as a community.

Which brings me to conclude with my final thought: After reading this atheism proclamation I and now even less atheist. The more arguments Penn makes the less I believe in atheism. Strange.

I guess I can say "God is leaving people to make up their own minds". Which means I can say: "Leaving people to make up their own minds is leaving people to make up their own minds.

But I did enjoy many of the stories in the book. You might too.

Before trolls get inspired let me start with this: EC2 is awesome!

But, wanna know what's also awesome?: Digital Ocean

The reason I switched was two-fold: A) money and B) curiousity.

As part of a very generous special friendship I got a "m1.large" for free. That deal had to come to an end so I had to start paying that myself. It was well over $100 per month. I have about 10 servers running on that machine hovering around 3+Gb of RAM.

So I thought this is an excuse to do some spring cleaning and then switch to this newfangled Digital Ocean which is all SSD drives, got good reviews and has a fixed cost per month. First I decommissioned some servers and some sites that used to have multiple processors were reduced to just a single process. Now I got everything down to a steady 2+Gb.

I decided to splash out a bit and I went for the $40/month option which is 4GB, 2 core, 60GB SSD and 4TB transfer. Setting up all the servers on this new Ubuntu 14.04 was relatively easy (thank you pip freeze and rsync!).

So far, I have to say I'm wildly impressed. The interface is gorgeous. It's easy to do everything. I love that the price is fixed. That suits me more that corporations might care about but I'm just little old me.

If you get inspired to try it out please use my referral code. Then you get $10 free credit: https://www.digitalocean.com/?refcode=9c9126b69f33

So recently, I moved home for this blog. It used to be on AWS EC2 and is now on Digital Ocean. I wanted to start from scratch so I started on a blank new Ubuntu 14.04 and later rsync'ed over all the data bit by bit (no pun intended).

When I moved this site I copied the /etc/uwsgi/apps-enabled/peterbecom.ini file and started it with /etc/init.d/uwsgi start peterbecom. The settings were the same as before:

# this is /etc/uwsgi/apps-enabled/peterbecom.ini
[uwsgi]
virtualenv = /var/lib/django/django-peterbecom/venv
pythonpath = /var/lib/django/django-peterbecom
user = django
master = true
processes = 3
env = DJANGO_SETTINGS_MODULE=peterbecom.settings
module = django_wsgi2:application

But I kept getting this error:

Traceback (most recent call last):
...
  File "/var/lib/django/django-peterbecom/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 182, in _cursor
    self.connection = Database.connect(**conn_params)
  File "/var/lib/django/django-peterbecom/venv/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
psycopg2.OperationalError: FATAL:  Peer authentication failed for user "django"

What the heck! I thought. I was able to connect perfectly fine with the same config on the old server and here on the new server I was able to do this:

django@peterbecom:~/django-peterbecom$ source venv/bin/activate
(venv)django@peterbecom:~/django-peterbecom$ ./manage.py shell
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from peterbecom.apps.plog.models import *
>>> BlogItem.objects.all().count()
1040

Clearly I've set the right password in the settings/local.py file. In fact, I haven't changed anything and I pg_dump'ed the data over from the old server as is.

I edit edited the file psycopg2/__init__.py and added a print "DSN=", dsn and those details were indeed correct.
I'm running the uwsgi app as user django and I'm connecting to Postgres as user django.

Anyway, what I needed to do to make it work was the following change:

# this is /etc/uwsgi/apps-enabled/peterbecom.ini
[uwsgi]
virtualenv = /var/lib/django/django-peterbecom/venv
pythonpath = /var/lib/django/django-peterbecom
user = django
uid = django   # THIS IS ADDED
master = true
processes = 3
env = DJANGO_SETTINGS_MODULE=peterbecom.settings
module = django_wsgi2:application

The difference here is the added uid = django.

I guess by moving across (I'm currently on uwsgi 1.9.17.1-debian) I get a newer version of uwsgi or something that simply can't just take the user directive but needs the uid directive too. That or something else complicated to do with the users and permissions that I don't understand.

Hopefully, by having blogged about this other people might find it and get themselves a little productivity boost.

If you do things with the Django ORM and want an audit trails of all changes you have two options:

  1. Insert some cleverness into a pre_save signal that writes down all changes some way.

  2. Use eventlog and manually log things in your views.

(you have other options too but I'm trying to make a point here)

eventlog is almost embarrassingly simple. It's basically just a model with three fields:

  • User
  • An action string
  • A JSON dump field

You use it like this:

from eventlog.models import log

def someview(request):
    if request.method == 'POST':
        form = SomeModelForm(request.POST)
        if form.is_valid():
            new_thing = form.save()
            log(request.user, 'mymodel.create', {
                'id': new_thing.id,
                'name': new_thing.name,
                # You can put anything JSON 
                # compatible in here
            })
            return redirect('someotherview')
    else:
        form = SomeModelForm()
    return render(request, 'view.html', {'form': form})

That's all it does. You then have to do something with it. Suppose you have an admin page that only privileged users can see. You can make a simple table/dashboard with these like this:

from eventlog.models import Log  # Log the model, not log the function

def all_events(request):
    all = Log.objects.all()
    return render(request, 'all_events.html', {'all': all})

And something like this to to all_events.html:

<table>
  <tr>
    <th>Who</th><th>When</th><th>What</th><th>Details</th>
  </tr>
  {% for event in all %}
  <tr>
    <td>{{ event.user.username }}</td>
    <td>{{ event.timestamp | date:"D d M Y" }}</td>
    <td>{{ event.action }}</td>
    <td>{{ event.extra }}</td>
  </tr>
  {% endfor %}
</table>

What I like about it is that it's very deliberate. By putting it into views at very specific points you're making it an audit log of actions, not of data changes.

Projects with overly complex model save signals tend to dig themselves into holes that make things slow and complicated. And it's not unrealistic that you'll then record events that aren't particularly important to review. For example, a cron job that increments a little value or something. It's more interesting to see what humans have done.

I just wanted to thank the Eldarion guys for eventlog. It's beautifully simple and works perfectly for me.