Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page! Currently only showing blog entries under the category: Linux. Clear filter

tl;dr Don't run ffmpeg over HTTP(S) and use ffmpegthumbnailer

UPDATE tl;dr Download the file then run ffmpeg with -ss HH:MM:SS first. Don't bother with ffmpegthumbnailer

At work I work on something called Air Mozilla. It's a site for hosting live video broadcasts and then archiving those so they can be retrieved later.

Unlike sites like YouTube we can't take a screencap from the video because many videos are future (aka. "upcoming") videos so instead we use a little placeholder thumbnail (for example, the Rust logo).

However, once it has been recorded we want to switch from the logo to an actual screen capture from the video itself. We set up a cronjob that uses ffmpeg to extract these as JPGs and then the users can go in and select whichever picture they like the best.

This is all work in progress by the way (as of December 2014).

One problem is that we have is that the command for extracting JPGs is really slow. So slow that we can't wrap the subprocess in a Django database connection because it's so slow that the database connection is often killed.

The command to extract them looks something like this:

ffmpeg -i https://cdnexample.com/url/to/file.mp4 -r 0.0143 /tmp/screencaps-%02d.jpg

Where the number r is based on the duration and how many pictures we want out. E.g. 0.0143 = 15 * 1049 where 15 is how many JPGs we want and 1049 is a duration of 17 minutes and 29 seconds.

The script I used first was: ffmpeg1.sh

My first experiment was to try to extract one picture at a time, hoping that way, internally, ffmpeg might be able to optimize something.

The second script I used was: ffmpeg2.sh

The third alternative was to try ffmpegthumbnailer which is an intricate wrapper on ffmpeg and it has the benefit that you can produce slightly higher picture quality too.

The third script I used was: ffmpeg3.sh

Bar chart comparing the 3 different scripts
And running these three depend very much on the state of my DSL at the time.

For a video clip that is 17 minutes long and a 138Mb mp4 file.

ffmpeg1.sh   2m0.847s
ffmpeg2.sh   11m46.734s
ffmpeg3.sh   0m29.780s

Clearly it's not efficient to do one screenshot at a time.
Because with ffmpegthumbnailer you can tell it not to reduce the picture quality the total weight of the produced JPGs from ffmpeg1.sh was 784Kb and the total weight from ffmpeg3.sh was 1.5Mb.

Just to try again, I ran a similar experiment with a 35 minutes long and 890Mb mp4 file. And this time I didn't bother with ffmpeg2.sh. The results were:

ffmpeg1.sh   18m21.330s
ffmpeg3.sh   2m48.656s

So that means that using ffmpegthumbnailer is about 5 times faster than ffmpeg. Huge difference!

And now, a curveball!

The reason for doing ffmpeg -i https://... was so that we don't have to first download the whole beast and run the command on a local file. However, in light of how so much longer this takes and my disdain to have to install and depend on a new tool (ffmpegthumbnailer) across all servers. Why not download the whole file and run the ffmpeg command locally.

So I download the file and it's slow because of my, currently, terrible home DSL. Then I run and time them again but just a local file instead:

ffmpeg1.sh   0m20.426s
ffmpeg3.sh   0m0.635s

Did you see that!? That's an insane difference. Clearly doing this command over HTTP(S) is a bad idea. It'll be worth downloading it first.

UPDATE

On Stackoverflow, LordNeckBeard gave a great tip of using the -ss option before in the input file and now it's much faster. At this point. I'm no longer interested in having to bother with ffmpegthumbnailer.

Let's fork ffmpeg2.sh into two versions.

ffmpeg2.1.sh same as ffmpeg2.sh but a downloaded file instead of a remote HTTPS URL.

ffmpeg2.2.sh as ffmpeg2.1.sh except we put the -ss HH:MM:SS before the input file.

Now, let's run them again on the 138Mb file:

# the 138Mb mp4.mp4 file
ffmpeg2.1.sh   2m10.898s
ffmpeg2.2.sh   0m0.672s

187 times faster

And again, I re-ran this again against a bigger file that is 1.4Gb:

# the 1.4Gb mp4-1.44Gb.mp4 file
ffmpeg2.1.sh   10m1.143s
ffmpeg2.2.sh   0m1.428s

420 times faster

When I build hugepic.io one of the biggest challenges was to image resizing of enourmous images. Primarily JPEGs.

The way Hugepic works is that it chops up images into tiles, but before it can crop and chop of the tiles it needs to resize the image to a certain size. Say 1024x1024. Now this is really slow and it's so CPU intensive that if you try to parallelize it you end up causing so much "swappage" that the time it takes to resize to large images in parallel is more than it takes to do them one at a time.

The tool I found that was the best possible was ImageMagick's tool convert.

Now there's a new tool that is much faster: vipsthumbnail

There are more comprehensive benchmarks abound the net, like this one for example, but here's a quick one to wet your appetite:

$ ls -lh 8/04/84c3e9.jpg
-rw-r--r--@ 1 peterbe  staff   253M Sep 16 12:00 8/04/84c3e9.jpg

$ time convert 8/04/84c3e9.jpg -resize 200 /tmp/converted-200.jpg
real    0m9.423s
user    0m8.893s
sys     0m0.521s

$ time vipsthumbnail 8/04/84c3e9.jpg -s 200x200 -o /tmp/vips.jpg
real    0m3.209s
user    0m3.051s
sys     0m0.138s

It supposedly has ports for Python but I'm quite happy to just a subprocess out to the command. You can install it on OSX with brew install vips.

Before trolls get inspired let me start with this: EC2 is awesome!

But, wanna know what's also awesome?: Digital Ocean

The reason I switched was two-fold: A) money and B) curiousity.

As part of a very generous special friendship I got a "m1.large" for free. That deal had to come to an end so I had to start paying that myself. It was well over $100 per month. I have about 10 servers running on that machine hovering around 3+Gb of RAM.

So I thought this is an excuse to do some spring cleaning and then switch to this newfangled Digital Ocean which is all SSD drives, got good reviews and has a fixed cost per month. First I decommissioned some servers and some sites that used to have multiple processors were reduced to just a single process. Now I got everything down to a steady 2+Gb.

I decided to splash out a bit and I went for the $40/month option which is 4GB, 2 core, 60GB SSD and 4TB transfer. Setting up all the servers on this new Ubuntu 14.04 was relatively easy (thank you pip freeze and rsync!).

So far, I have to say I'm wildly impressed. The interface is gorgeous. It's easy to do everything. I love that the price is fixed. That suits me more that corporations might care about but I'm just little old me.

If you get inspired to try it out please use my referral code. Then you get $10 free credit: https://www.digitalocean.com/?refcode=9c9126b69f33

So recently, I moved home for this blog. It used to be on AWS EC2 and is now on Digital Ocean. I wanted to start from scratch so I started on a blank new Ubuntu 14.04 and later rsync'ed over all the data bit by bit (no pun intended).

When I moved this site I copied the /etc/uwsgi/apps-enabled/peterbecom.ini file and started it with /etc/init.d/uwsgi start peterbecom. The settings were the same as before:

# this is /etc/uwsgi/apps-enabled/peterbecom.ini
[uwsgi]
virtualenv = /var/lib/django/django-peterbecom/venv
pythonpath = /var/lib/django/django-peterbecom
user = django
master = true
processes = 3
env = DJANGO_SETTINGS_MODULE=peterbecom.settings
module = django_wsgi2:application

But I kept getting this error:

Traceback (most recent call last):
...
  File "/var/lib/django/django-peterbecom/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 182, in _cursor
    self.connection = Database.connect(**conn_params)
  File "/var/lib/django/django-peterbecom/venv/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
psycopg2.OperationalError: FATAL:  Peer authentication failed for user "django"

What the heck! I thought. I was able to connect perfectly fine with the same config on the old server and here on the new server I was able to do this:

django@peterbecom:~/django-peterbecom$ source venv/bin/activate
(venv)django@peterbecom:~/django-peterbecom$ ./manage.py shell
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from peterbecom.apps.plog.models import *
>>> BlogItem.objects.all().count()
1040

Clearly I've set the right password in the settings/local.py file. In fact, I haven't changed anything and I pg_dump'ed the data over from the old server as is.

I edit edited the file psycopg2/__init__.py and added a print "DSN=", dsn and those details were indeed correct.
I'm running the uwsgi app as user django and I'm connecting to Postgres as user django.

Anyway, what I needed to do to make it work was the following change:

# this is /etc/uwsgi/apps-enabled/peterbecom.ini
[uwsgi]
virtualenv = /var/lib/django/django-peterbecom/venv
pythonpath = /var/lib/django/django-peterbecom
user = django
uid = django   # THIS IS ADDED
master = true
processes = 3
env = DJANGO_SETTINGS_MODULE=peterbecom.settings
module = django_wsgi2:application

The difference here is the added uid = django.

I guess by moving across (I'm currently on uwsgi 1.9.17.1-debian) I get a newer version of uwsgi or something that simply can't just take the user directive but needs the uid directive too. That or something else complicated to do with the users and permissions that I don't understand.

Hopefully, by having blogged about this other people might find it and get themselves a little productivity boost.

I, multiple times per day, find myself wanting to find out what headers I get back on a URL but I don't care about the response payload. The command to use then is:

curl -v http://www.peterbe.com/ > /dev/null

That'll print out all the headers sent and received. Nice and crips.

So because I type this every day I made it into a shortcut script

cd ~/bin
echo '#!/bin/bash
> set -x
> curl -v "$@" > /dev/null
> ' > c
chmod +x c

If it's not clear what the code looks like, it's this:

#!/bin/bash
set -x
curl -v "$@" > /dev/null

Now I can just type:

c http://www.peterbe.com

Or if I want to add some extra request headers for example:

c -H 'User-Agent: foobar' http://www.peterbe.com

I just learned a really good bash trick which is something I've wanted to have but didn't really appreciate that it was possible so I never even searched for it.

set -ex

Ok, one thing at a time.

set -e

What this does, at the top of your bash script is that it exits as soon as any line in the bash script fails.
Suppose you have a script like this:

git pull origin master
find . | grep '\.pyc$' | xargs rm
./restart_server.sh

If the first line fails you don't want the second line to execute and you don't want the third line to execute either. The naive solution is to "and" them:

git pull origin master && find . | grep '\.pyc$' | xargs rm && ./restart_server.sh

but now it's just getting silly. (and is it even working?)

What set -e does is that it exists if any of the lines fail.

set -x

What this does is that it prints each command that is going to be executed with a little plus.
The output can look something like this:

+ rm -f pg_all.sql pg_all.sql.gz
+ pg_dumpall
+ apack pg_all.sql.gz pg_all.sql
++ date +%A
+ s3cmd put --reduced-redundancy pg_all.sql.gz s3://db-backups-peterbe/Sunday/
pg_all.sql.gz -> s3://db-backups-peterbe/Sunday/pg_all.sql.gz  [part 1 of 2, 15MB]
 15728640 of 15728640   100% in    0s    21.22 MB/s  done
pg_all.sql.gz -> s3://db-backups-peterbe/Sunday/pg_all.sql.gz  [part 2 of 2, 14MB]
 14729510 of 14729510   100% in    0s    21.50 MB/s  done
+ rm pg_all.sql pg_all.sql.gz

...when the script looks like this:

#!/bin/bash
set -ex
rm -f pg_all.sql pg_all.sql.gz
pg_dumpall > pg_all.sql
apack pg_all.sql.gz pg_all.sql
s3cmd put --reduced-redundancy pg_all.sql.gz s3://db-backups-peterbe/`date +%A`/
rm pg_all.sql pg_all.sql.gz

And to combine these two gems you simply put set -ex at the top of your bash script.

Thanks @bramwelt for showing me this one.

I have and have had many sites that I run. They're all some form of side-project.

What they almost all have in common is two things

  1. They have very little traffic (thus not particularly mission critical)
  2. I run everything on one server (no need for "spinning up" new VMs here and there)

Many many years ago, when current interns I work with were mere babies, I started a very simple "procedure".

  1. On the server, in the user directory where the site is deployed, I write a script called something like upgrade_myproject.sh which is executable and does what the name of the script is: it upgrades the site.

  2. In the server's root home directory I write a script called restart_myproject.sh which also does exactly what the name of the script is: it restarts the service.

  3. On my laptop, in my ~/bin directory I create a script called UpgradeMyproject.sh (*) which runs upgrade_myproject.sh on the server and runs restart_myproject.sh also on the server.

And here is, if I may say so, the cleverness of this; I use ssh to execute these scripts remotely by simply piping the commands to ssh. For example:

#!/bin/bash
echo "./upgrade_generousfriends.sh" | ssh -A django@ec2-54-235-210-62.compute-1.amazonaws.com
echo "./restart_generousfriends.sh" | ssh root@ec2-54-235-210-62.compute-1.amazonaws.com

That's an example I use for Wish List Granted.

This works so darn well, and has done for years, that this is why I've never really learned to use more advanced tools like Fabric, Salt, Puppet, Chef or <insert latest deployment tool name>.

This means that all I need to do run a deployment is just type UpgradeMyproject.sh[ENTER] and the simple little bash scripts takes care of everything else.

The reason I keep these on the server and not on my laptop is simply because that's where they naturally belong and if I'm ssh'ed in and mess around I don't have to exit out to re-run them.

Here's an example of the upgrade_generousfriends.sh I use for Wish List Granted:

#!/bin/bash
cd generousfriends
source venv/bin/activate
git pull origin master
find . | grep '\.pyc$' | xargs rm -f
pip install -r requirements/prod.txt
./manage.py syncdb --noinput
./manage.py migrate webapp.main
./manage.py collectstatic --noinput
./manage.py compress --force
echo "Restart must be done by root"

I hope that, by blogging about this, that someone else sees that it doesn't really have to be that complicated. It's not rocket science and most complex tools are only really needed when you have a significant bigger scale in terms of people- and skill-complexity.

In conclusion

Keep it simple.

(*) The reason for the capitalization of my scripts is also an old habit. I use that habit to differentiate my scripts for stuff I install from any third parties.

As of moving over hugepic.io to my new EC2 server I now have all my working sites all under one server.

If I list all sites in /etc/nginx/sites-enabled/ I count 14 sites. This blog being one of many. More listed here.

All but one of these services are Python. One is a Node server. About half of the Python services are Django and the other half is Tornado. There are four persistant databases (Postgres, Redis, Memcache, MongoDB) and two message queues (RabbitMQ and Python RQ).

I have this little script called ps_mem.py which does a decent job summorizing how much memory all of these take. Its output currently looks like this:

 Private  +   Shared  =  RAM used   Program
 ...
  6.5 MiB +  27.3 MiB =  33.7 MiB   postgres (5)
 40.1 MiB +  58.0 KiB =  40.1 MiB   memcached
 54.7 MiB +  37.5 KiB =  54.7 MiB   redis-server
 72.2 MiB + 849.0 KiB =  73.1 MiB   mongod
 82.4 MiB +   1.5 MiB =  83.9 MiB   rqworker (10)
605.6 MiB + 350.9 MiB = 956.5 MiB   python (61)
  1.9 GiB +  51.2 MiB =   2.0 GiB   uwsgi-core (26)
---------------------------------
                          3.3 GiB                       

It's sorted by "RAM used" and I just showed here the bottom 7 ones.
Anyway, 3.3 Gb to run 14 sites isn't bad. All through one Nginx (which only uses 10Mb by the way).

The server is a Debian 7 on a reserved Large instance. I'll try to post an update later about this server with more details. I have a lot of work to do to set up all monitoring and backups for all these things.

(if you're wondering what you're doing here, jed is a hardcore text based editor for programmers)

Thanks to fellow Jed user and hacker Ullrich Horlacher I can now have local settings per directory.

I personally prefer 2 spaces in my Javascript. And thankfully most projects I work on agrees with that standard. However, I have one Mozilla project I work on which uses 4 spaces for indentation. So, what I've had to get used to to is to edit my ~/.jedrc every time I switch to work on that particular project. I change: variable C_INDENT = 2; to variable C_INDENT = 4; and then back again when switching to another project.

No more of that. Now I just add a file into the project root like this:

$ cd dev/airmozilla
$ cat .jed.sl
variable C_INDENT = 4;

And whenever I work on any file in that tree it applies the local override setting.

Here's how you can do that too:

First, put this code into your <your jed lib>/defaults.sl: (on my OSX, the jed lib is /usr/local/Cellar/jed/0.99-19/jed/lib/)

% load .jed.sl from current or parent directories
% but only if the user is the same
define load_local_config() {
  variable dir = getcwd();
  variable uid = getuid;
  variable jsl,st;
  while (dir != "/" and strlen(dir) > 1) {
    st = stat_file(dir);
    if (st == NULL) return;
    if (st.st_uid != uid) return;
    jsl = dir + "/.jed.sl";
    st = stat_file(jsl);
    if (st != NULL) {
      if (st.st_uid == uid) {
        pop(evalfile(jsl));
        return;
      }
    }
    dir = path_dirname(dir);
  }
}

Then add this to the bottom of your ~/.jedrc:

define startup_hook() {
  load_local_config(); % .jed.sl
}

Now, go into a directory where you want to make local settings, create a file called .jed.sl and fill it to your hearts content!

I'm quite fond of hastebin.com. It's fast. It's reliable. And it's got nice keyboard shortcuts that work for my taste.

So, I created a little program to quickly throw things into hastebin. You can have one too:

First create ~/bin/hastebinit and paste in:

#!/usr/bin/python

import urllib2
import os
import json

URL = 'http://hastebin.com/documents'

def run(*args):
    if args:
        content = [open(x).read() for x in args]
        extensions = [os.path.splitext(x)[1] for x in args]
    else:
        content = [sys.stdin.read()]
        extensions = [None]

    for i, each in enumerate(content):
        req = urllib2.Request(URL, each)
        response = urllib2.urlopen(req)
        the_page = response.read()
        key = json.loads(the_page)['key']
        url = "http://hastebin.com/%s" % key
        if extensions[i]:
            url += extensions[i]
        print url

if __name__ == '__main__':
    import sys
    sys.exit(run(*sys.argv[1:]))

Then run: chmod +x ~/bin/hastebinit

Now you can do things like:

$ cat ~/myfile | hastebinit
$ hastebinit < ~/myfile
$ hastebinit ~/myfile myotherfile

Hopefully it'll one day help at least one more soul out there!