Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page! Currently only showing blog entries under the category: Python. Clear filter

It's the old problem of "Do I seek permission or ask for forgiveness?". It's rarely easy to know which one to use in Python because working with exceptions in Python is so damn easy.

Generally I prefer neither. I.e. just do. Don't write defensive code if you don't have to. Only seek permission or ask for forgiveness if you expect it to happen and that that's normal.

Consider the following three functions:

def f0(x):
    return PI / x



def f1(x):
    if x != 0:
        return PI / x
    else:
        return -1


def f2(x):
    try:
        return PI / x
    except ZeroDivisionError:
        return -1

Which one do you think is the fastest? If I run this 1,000,000 times and never pass in a value for x=0 will it make any difference?

Before you look at it, what do you think the result will be?


The answer is below.


Read on.


Scroll down for the results.


Have you made a guess yet?


What do you think it's going to be?


Scroll some more.


Almost there!


Ok, the results are as follows when running each of the above mentioned functions ~33,000,000 times on my MacBook:

f0 4.16087803245
f1 4.84187698364
f2 4.73760977387
(smaller is better)

Conclusion, the difference is miniscule. The fastest is to not do any exception handling or condition checking but it's generally no big difference.

This test was done with Python 2.7.9. You can try the code for yourself.

Just one more thought

As I wrote this post I started thinking more and more about the "code style aspect" rather than the performance.

Basically, I think it boils down to the following rules:

  1. If you're working with external I/O (e.g. network or a database) use the "ask for forgiveness" approach (aka. exception wrapping). I.e. don't do if requests.head(url).status_code == 200: stuff = requests.get(url)

  2. If you want to make a really user-friendly Python API, use the "seek permission" approach (aka. if-statement first). E.g. def calculate(guests): if isinstance(guests, basestring): guests = [guests]

  3. All else just do. That makes the code more Pythonic. If you have a sub-routine that sends in variable of the totally crazy-wrong type to your function, don't change the function, change the sub-routine.

UPDATE

Here are the numbers for PyPy:

f0 0.369750552707
f1 0.321069081624
f2 0.411438703537
(smaller is better)

That's after averaging 15 runs of the script.

Note that the function with the extra if statement is faster.

And here are the numbers of Python 3.4.2:

f0 4.99579153742
f1 5.77459328515
f2 5.38382162367
(smaller is better)

That's averaging 10 rounds.

One almost interesting thing about these numbers is that the sum of them are different and tells us a tiny story about performance for the language:

Python 2.7.9   13.74036478996
PyPy 2.4.0     1.102258337868
Python 3.4.2   16.15420644624
(smaller is better)

UPDATE 2

Here's the node equivalent version and its times:

f0 0.215509441
f1 0.228280196357
f2 0.316222934714
(smaller is better)

That means that my Node v0.10.35 is 45% faster than PyPy. But please, don't take that seriously.

I just pushed out a new release of premailer which comes with a pretty big change.

What it means is that the way the base_url and any href= or src= gets combined. For example, you used to be able to set Premailer(html, base_url='http://example.com/subfolder') and combined with <img src="//d1ac1bzf3lrf3c.cloudfront.net/CONTENTCACHE-1431361865/images/foo.png"> it would become <img src="http://example.com/subfolder/images/foo.png">.

Not any more. The joining works exactly like the Python built-in urljoin() works. E.g.

>>> from urllib.parse import urljoin  # python 3
>>> urljoin('https://example.com', '/image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder', '/image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder/', '/image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder/', '//image.png')
'https://image.png'
>>> urljoin('https://example.com/subfolder/', '//mycdn.com/image.png')
'https://mycdn.com/image.png'
>>> urljoin('http://example.com/subfolder/', '//mycdn.com/image.png')
'http://mycdn.com/image.png'
>>> urljoin('https://example.com/subfolder', 'image.png')
'https://example.com/image.png'
>>> urljoin('https://example.com/subfolder/', 'image.png')
'https://example.com/subfolder/image.png'

So basically, if you think you tried to do something odd with your base_url check it over carefully when you upgrade to version 2.9.0.

Thank you @ewjoachim and @graingert for your help!

The idea with template context processors in Django is to inject some defaults thing to be available when rendering a template that is rendered with a request.

I.e. instead of...:

def view1(request):
    context = {
        'name': 'View 1', 
        'on_dev_server': request.get_host() in settings.DEV_HOSTNAMES
    }
    return render(request, 'view1.html', context)

def view2(request):
    context = {
        'name': 'View 2', 
        'other': 'things', 
        'on_dev_server': request.get_host() in settings.DEV_HOSTNAMES
    }
    return render(request, 'view2.html', context)

And in your nominal templates/base.html you might have something like this:

  ...
  <footer>
  <p>&copy; You 2015</p>
  {% if on_dev_server %}
    <p color="red">Note! We're currently on a dev server!</p>
  {% endif %}
  </footer>
  ...

Instead you do this trick; in your settings.py you write down the list of defaults plus the one you want to always have available:

TEMPLATE_CONTEXT_PROCESSORS = (
    "django.contrib.auth.context_processors.auth",
    "django.template.context_processors.static",
    "myproject.myapp.context_processors.debug_info",
)

And to accompany that you define your myprojects/myapp/context_processors.py like so:

def debug_info(request):
    return {
        'on_dev_server': request.get_host() in settings.DEV_HOSTNAMES,
    }

So far so good.

However, there's a problem with this. Two problems in fact.

First problem is that when all the templates in your big complicated website renders, it's quite possible that some pages don't need everything you set up in your context processors. That might mean a heck of a lot of extra computation when it won't ever be displayed.

For example, I have a project where most pages have a sidebar where I show "Trending Events" which is something I compute in a context_processors.py function called def sidebar_events(request):. But the sidebar is not always shown and on the pages where it's not shown it's a waste to compute the stuff that sidebar_events computes. Also, I have management pages which uses a totally different base.html template. So there's a big chance you're wasting precious CPU.

Another problem is that of code-readability (aka. how frustrating is this to debug for someone else or yourself after months of idle activity). If you're skimming through your base.html and you see this "random" variable called on_dev_server it's very very hard to tell where the heck that's defined. Hopefully grepping the whole source code is a way to go. A much better way to solve that problem would be sensible namespace naming.

And also, by being too liberal with globally scoped variables there's a chance you might clash from a different piece of functionality that uses the same variable names. That chance is smaller when you use namespaces.

So, to remedy this, let your template context processor functions return closures. It wraps the request automagically.

Let's rewrite our trivial example from above, the context_processors.py should now look like this:

def debug_info(request):
    def inner():
        return {
            'on_dev_server': request.get_host() in settings.DEV_HOSTNAMES,
        }
    return {'debug_info': inner}

Now executing that becomes more optional and more deliberate in the template instead. E.g.

  ...
  <footer>
  <p>&copy; You 2015</p>
  {% set debug_info = debug_info() %}
  {% if debug_info['on_dev_server'] %}
    <p color="red">Note! We're currently on a dev server!</p>
  {% endif %}
  </footer>
  ...

This makes it more explicity which is a good thing. It also has the potential to be avoided if the stuff in there isn't needed in some templates.

After a day of pushing 9 commits to a PR to finally get Travis to build a simple python package on python 2.6, 2.7, 3.3 and 3.4 I finally gave up and ripped out all of httpretty and replaced it with good old mock.patch()

I was getting all sorts of strange warnings in py3.3 and 3.4 got stuck all the time.
This is not the first time httpretty has been causing confusion so from now on I'm giving up on httpretty. Ithink it was too good to be true to work reliably. Honestly, it might be python's fault for not being better made available to cool libs like httpretty.

By the way, here's one of those errors where Python 3.4 just hangs which stopped being the case once I took out httpretty. And here you can see the clear failure to deactivate the monkeypatch even after the test is complete in Python 3.3.

First of all; hashing is hard. But fortunately it gets a little bit easier if it doesn't have to cryptographic. A non-cryptographic hashing function is basically something that takes a string and converts it to another string in a predictable fashion and it tries to do it with as few clashes as possible and as fast as possible.

MD5 is a non-cryptographic hashing function. Unlike things like sha256 or sha512 the MD5 one is a lot more predictable.

Now, how do you make a hashing function that yields a string that is as short as possible? The simple answer is to make the output use as many different characters as possible. If a hashing function only returns integers you only have 10 permutations per character. If you instead use a-z and A-Z and 0-9 you now have 26 + 26 + 10 permutations per character.

A hex on the other hand only uses 0-9 and a-f which is only 10 + 6 permutations. So you need a longer string to be sure it's unique and can't clash with another hash output. Git for example uses a 40 character log hex string to prepresent a git commit. GitHub is using an appreviated version of that in some of the web UI of only 7 characters which they get away with because things are often in a context of a repo name or something like that. For example github.com/peterbe/django-peterbecom/commit/462ae0c

So, what other choices do you have when it comes to returning a hash output that is sufficiently long that it's "almost guaranteed" to be unique but sufficiently short that it becomes practical in terms of storage space? I have an app for example that turns URLs into unique IDs because they're shorter that way and more space efficient to store as values in a big database. One such solution is to use a base64 encoding.

Base64 uses a-zA-Z0-9 but you'll notice it doesn't have the "hashing" nature in that it's just a direct translation character by character. E.g.

>>> base64.encodestring('peterbengtsson')
'cGV0ZXJiZW5ndHNzb24=\n'
>>> base64.encodestring('peterbengtsson2')
'cGV0ZXJiZW5ndHNzb24y\n'

I.e. these two strings are different but suppose you were to take only the first 10 characters these would be the same. Basically, here's a terrible hashing function:

def hasher(s):  # this is not a good hashing function
    return base64.encodestring(s)[:10]

So, what we want is a hashing function that returns output that is short and very rarely clashing and does this as fast as possible.

To test this I wrote a script that tried a bunch of different ad-hoc hashing functions. I generate a list of 130,000+ different words with an average length of 15 characters. Then I loop over these words until a hashed output is repeated for a second time. And for each, I take the time it takes to generate the 130,000+ hashes and I multiply that with the total number of bytes. For example, if the hash output is 9 characters each in length that's (130000 * 9) / 1024 ~= 1142Kb. And if it took 0.25 seconds to generate all of those the combined score is 1142 * 0.24 ~= 286 bytes second.

Anyway, here are the results:

h11 100.00  0.217s  1184.4 Kb   257.52 kbs
h6  100.00  1.015s  789.6 Kb    801.52 kbs
h10 100.00  1.096s  789.6 Kb    865.75 kbs
h1  100.00  0.215s  4211.2 Kb   903.46 kbs
h4  100.00  1.017s  921.2 Kb    936.59 kbs

(kbs means "kilobytes seconds")

These are the functions that returned 0 clashes amongst 134,758 unique words. There were others too that I'm not bothering to include because they had clashes. So let's look at these functions:

def h11(w):
    return hashlib.md5(w).hexdigest()[:9]

def h6(w):
    h = hashlib.md5(w)
    return h.digest().encode('base64')[:6]

def h10(w):
    h = hashlib.sha256(w)
    return h.digest().encode('base64')[:6]

def h1(w):
    return hashlib.md5(w).hexdigest()

def h4(w):
    h = hashlib.md5(w)
    return h.digest().encode('base64')[:7]    

It's kinda arbitrary to say the "best" one is the one that takes the shortest time multipled by size. Perhaps the size matters more to you in that case the h6() function is better because it returns 6 character strings instead of 9 character strings in h11.

I'm apprehensive about publishing this blog post because I bet I'm doing this entirely wrong. Perhaps there are better ways to digest a hashing function that returns strings that don't need to be base64 encoded. I just haven't found any in the standard library yet.

In airmozilla the tests almost all derive from one base class whose tearDown deletes the automatically generated settings.MEDIA_ROOT directory and everything in it.

Then there's some code that makes sure a certain thing from the fixtures has a picture uploaded to it.

That means it has do that shutil.rmtree(directory) and that shutil.copy(src, dst) on almost every single test. Some might also not need or depend on it but it's conveninent to put it here.

Anyway, I thought this is all a bit excessive and I could probably optimize that by defining a custom test runner that is first responsible for creating a clean settings.MEDIA_ROOT with the necessary file in it and secondly, when the test suite ends, it deletes the directory.

But before I write that, let's measure how many gazillion milliseconds this is chewing up.

Basically, the tearDown was called 361 times and the _upload_media 281 times. In total, this adds to a whopping total of 0.21 seconds! (of the total of 69.133 seconds it takes to run the whole thing).

I think I'll cancel that optimization idea. Doing some light shutil operations are dirt cheap.

So recently, I moved home for this blog. It used to be on AWS EC2 and is now on Digital Ocean. I wanted to start from scratch so I started on a blank new Ubuntu 14.04 and later rsync'ed over all the data bit by bit (no pun intended).

When I moved this site I copied the /etc/uwsgi/apps-enabled/peterbecom.ini file and started it with /etc/init.d/uwsgi start peterbecom. The settings were the same as before:

# this is /etc/uwsgi/apps-enabled/peterbecom.ini
[uwsgi]
virtualenv = /var/lib/django/django-peterbecom/venv
pythonpath = /var/lib/django/django-peterbecom
user = django
master = true
processes = 3
env = DJANGO_SETTINGS_MODULE=peterbecom.settings
module = django_wsgi2:application

But I kept getting this error:

Traceback (most recent call last):
...
  File "/var/lib/django/django-peterbecom/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 182, in _cursor
    self.connection = Database.connect(**conn_params)
  File "/var/lib/django/django-peterbecom/venv/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
psycopg2.OperationalError: FATAL:  Peer authentication failed for user "django"

What the heck! I thought. I was able to connect perfectly fine with the same config on the old server and here on the new server I was able to do this:

django@peterbecom:~/django-peterbecom$ source venv/bin/activate
(venv)django@peterbecom:~/django-peterbecom$ ./manage.py shell
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from peterbecom.apps.plog.models import *
>>> BlogItem.objects.all().count()
1040

Clearly I've set the right password in the settings/local.py file. In fact, I haven't changed anything and I pg_dump'ed the data over from the old server as is.

I edit edited the file psycopg2/__init__.py and added a print "DSN=", dsn and those details were indeed correct.
I'm running the uwsgi app as user django and I'm connecting to Postgres as user django.

Anyway, what I needed to do to make it work was the following change:

# this is /etc/uwsgi/apps-enabled/peterbecom.ini
[uwsgi]
virtualenv = /var/lib/django/django-peterbecom/venv
pythonpath = /var/lib/django/django-peterbecom
user = django
uid = django   # THIS IS ADDED
master = true
processes = 3
env = DJANGO_SETTINGS_MODULE=peterbecom.settings
module = django_wsgi2:application

The difference here is the added uid = django.

I guess by moving across (I'm currently on uwsgi 1.9.17.1-debian) I get a newer version of uwsgi or something that simply can't just take the user directive but needs the uid directive too. That or something else complicated to do with the users and permissions that I don't understand.

Hopefully, by having blogged about this other people might find it and get themselves a little productivity boost.

tl;dr; It's not a competition! I'm just comparing Go and Python. So I can learn Go.

So recently I've been trying to learn Go. It's a modern programming language that started at Google but has very little to do with Google except that some of its core contributors are staff at Google.

The true strength of Go is that it's succinct and minimalistic and fast. It's not a scripting language like Python or Ruby but lots of people write scripts with it. It's growing in popularity with systems people but web developers like me have started to pay attention too.

The best way to learn a language is to do something with it. Build something. However, I don't disagree with that but I just felt I needed to cover the basics first and instead of taking notes I decided to learn by comparing it to something I know well, Python. I did this a zillion years ago when I tried to learn ZPT by comparing it DTML which I already knew well.

My free time is very limited so I'm taking things by small careful baby steps. I read through An Introduction to Programming in Go by Caleb Doxey in a couple of afternoons and then I decided to spend a couple of minutes every day with each chapter and implement something from that book and compare it to how you'd do it in Python.

I also added some slightly more full examples, Markdownserver which was fun because it showed that a simple Go HTTP server that does something can be 10 times faster than the Python equivalent.

What I've learned

  • Go is very unforgiving but I kinda like it. It's like Python but with pyflakes switched on all the time.

  • Go is much more verbose than Python. It just takes so much more lines to say the same thing.

  • Goroutines are awesome. They're a million times easier to grok than Python's myriad of similar solutions.

  • In Python, the ability to write to a list and it automatically expanding at will is awesome.

  • Go doesn't have the concept of "truthy" which I already miss. I.e. in Python you can convert a list type to boolean and the language does this automatically by checking if the length of the list is 0.

  • Go gives you very few choices (e.g. there's only one type of loop and it's the for loop) but you often have a choice to pass a copy of an object or to pass a pointer. Those are different things but sometimes I feel like the computer could/should figure it out for me.

  • I love the little defer thing which means I can put "things to do when you're done" right underneath the thing I'm doing. In Python you get these try: ...20 lines... finally: ...now it's over... things.

  • The coding style rules are very different but in Go it's a no brainer because you basically don't have any choices. I like that. You just have to remember to use gofmt.

  • Everything about Go and Go tools follow the strict UNIX pattern to not output anything unless things go bad. I like that.

  • godoc.org is awesome. If you ever wonder how a built in package works you can just type it in after godoc.org like this godoc.org/math for example.

  • You don't have to compile your Go code to run it. You can simply type go run mycode.go it automatically compiles it and then runs it. And it's super fast.

  • go get can take a url like github.com/russross/blackfriday and just install it. No PyPI equivalent. But it scares me to depend on peoples master branches in GitHub. What if master is very different when I go get something locally compared to when I run go get weeks/months later on the server?

In action
A couple of weeks ago we had accidentally broken our production server (for a particular report) because of broken HTML. It was an unclosed tag which rendered everything after that tag to just plain white. Our comprehensive test suite failed to notice it because it didn't look at details like that. And when it was tested manually we simply missed the conditional situation when it was caused. Neither good excuses. So it got me thinking how can we incorporate HTML (html5 in particular) validation into our test suite.

So I wrote a little gist and used it a bit on a couple of projects and was quite pleased with the results. But I thought this might be something worthwhile to keep around for future projects or for other people who can't just copy-n-paste a gist.

With that in mind I put together a little package with a README and a setup.py and now you can use it too.

There are however some caveats. Especially if you intend to run it as part of your test suite.

Caveat number 1

You can't flood htmlvalidator.nu. Well, you can I guess. It would be really evil of you and kittens will die. If you have a test suite that does things like response = self.client.get(reverse('myapp:myview')) and there are many tests you might be causing an obscene amount of HTTP traffic to them. Which brings us on to...

Caveat number 2

The htmlvalidator.nu site is written in Java and it's open source. You can basically download their validator and point django-html-validator to it locally. Basically the way it works is java -jar vnu.jar myfile.html. However, it's slow. Like really slow. It takes about 2 seconds to run just one modest HTML file. So, you need to be patient.

Premailer is probably my most successful open source project in recent years. I base that on the fact that 25 different people have committed to it.

Today I merged a monster PR by Michael Jason Smith of OnlineGroups.net.

What it does is basically that it makes premailer work in Python 3, PyPy and Python 2.6. Check out the tox.ini file. Test coverage is still 100%.

If you look at the patch the core of the change is actually surprisingly little. The majority of the "secret sauce" is basically a bunch of import statements which are split by if sys.version_info >= (3, ): and some various minor changes around encoding UTF-8. The rest of the changes are basically test sit-ups.

A really interesting thing that hit us was that the code had assumptions about the order of things. Basically the tests assumed the the order of certain things in the resulting output was predictable even though it was done using a dict. dicts are famously unreliable in terms of the order you get things out and it's meant to be like that and it's a design choice. The reason it worked till now is not only luck but quite amazing.

Anyway, check it out. Now that we have a tox.ini file it should become much easier to run tests which I hope means patches will be better checked as they come in.