Peterbe.com

A blog and website by Peter Bengtsson

Filtered home page!
Currently only showing blog entries under the category: Mozilla. Clear filter

Mozilla Symbol Server (aka. Tecken) load testing

06 September 2017 0 comments   Mozilla, Django, Web development, Python


(Thanks Miles Crabil not only for being an awesome Ops person but also for reviewing this blog post!)

My project over the summer, here at Mozilla, has been a project called Mozilla Symbol Server. It's a web service that uploads C++ symbol files, downloads C++ symbol files and symbolicates C++ crash stacktraces. It went into production last week which was fun but there's still lots of work to do on adding beyond-parity features and more optimizations.

What Is Mozilla Symbol Server?

The code name for this project is Tecken and it's written in Python (Django, Gunicorn) and uses PostgreSQL, Redis and Celery. The frontend is entirely static and developed (almost) as a separate project within. The frontend is written in React (using create-react-app and react-router). Everything is run as Docker containers. And if you ask me more details about how it's configured/deployed I'm afraid I have to defer to the awesome Mozilla CloudOps team.

One the challenges I faces developing Tecken is that symbol downloads need to be fast to handle high volumes of traffic. Today I did some load testing on our stage deployment and managed to start 14 concurrent clients that bombarded our staging server with realistic HTTPS GET queries based on log files. It's actually 7 + 1 + 4 + 2 concurrent clients. 7 of them from a m3.2xlarge EC2 node (8 vCPUs), 1 from a m3.large EC2 node (1 vCPU), 2 from two separate NYC based DigitalOcean personal servers and 2 clients here from my laptop on my home broadband. Basically, each loadtest script process got its own CPU.

Total req/s
It's hard to know how much more each client could push if it wasn't slowed down. Either way, the server managed to sustain about 330 requests per second. Our production baseline goal is to able to handle at least 40 requests per second.

After running for a while the caches started getting warm but about 1-5% of requests do have to make a boto3 roundtrip to an S3 bucket located on the other side of America in Oregon. There is also a ~5% penalty in that some requests trigger a write to a central Redis ElastiCache server. That's cheaper than the boto3 S3 call but still hefty latency costs to pay.

The ELB in our staging environment spreads the load between 2 c4.large (2 vCPUs, 3.75GB RAM) EC2 web heads. Each running with preloaded Gunicorn workers between Nginx and Django. Each web head has its own local memcached server to share memory between each worker but only local to the web head.

Is this a lot?

How long is a rope? Hard to tell. Tecken's performance is certainly more than enough and by the sheer fact that it was only just production deployed last week tells me we can probably find a lot of low-hanging fruit optimizations on the deployment side over time.

One way of answering that is to compare it with our lightest endpoint. One that involves absolutely no external resources. It's just pure Python in the form of ELB → Nginx → Gunicorn → Django. If I run hey from the same server I did the load testing I get a topline of 1,300 requests per second.

$ hey -n 10000 -c 10 https://symbols.stage.mozaws.net/__lbheartbeat__
Summary:
  Total:    7.6604 secs
  Slowest:  0.0610 secs
  Fastest:  0.0018 secs
  Average:  0.0075 secs
  Requests/sec: 1305.4199
...  

That basically means that all the extra "stuff" (memcache key prep, memcache key queries and possible other high latency network requests) it needs to do in the Django view takes up roughly 3x the time it takes the absolute minimal Django request-response rendering.

Also, if I use the same technique to bombard a single URL, but one that actually involves most code steps but is definitely able to not require any slow ElastiCache writes or boto3 S3 reads you I get 800 requests per second:

$ hey -n 10000 -c 10 https://symbols.stage.mozaws.net/advapi32.pdb/5EFB9BF42CC64024AB64802E467394642/advapi32.sy
Summary:
  Total:    12.4160 secs
  Slowest:  0.0651 secs
  Fastest:  0.0024 secs
  Average:  0.0122 secs
  Requests/sec: 805.4150
  Total data:   300000 bytes
  Size/request: 30 bytes
...

Lesson learned

Max CPU Used
It's a recurring reminder that performance is almost all about latency. If not RAM or disk it's networking. See the graph of the "Max CPU Used" which basically shows that CPU of user, system and stolen ("CPU spent waiting for the hypervisor to service another virtual CPU") never sum totalling over 50%.

crontabber now supports locking, both high- and low-level

04 March 2017 0 comments   PostgreSQL, Mozilla, Python

https://github.com/mozilla/crontabber#how-locking-works


tl;dr; In other words, you can now have multiple servers with crontabber, all talking to a central PostgreSQL for its state, and not have to worry about jobs being started more than exactly once. This will be super useful if your crontabber apps are such that they kick of stored procedures that would freak out if run more than once with the same parameters.

crontabber is an advanced Python program to run cron-like applications in a predictable way. Every app is a Python class with a run method. Example here. Until version 0.18 you had to do locking outside but now the locking has been "internalized". Meaning, if you open two terminals and run python crontabber.py --admin.conf=myconfig.ini in both you don't have to worry about it starting the same apps in parallel.

General, business logic locking

Every app has a state. It's stored in PostgreSQL. It looks like this:

# \d crontabber
              Table "public.crontabber"
    Column    |           Type           | Modifiers
--------------+--------------------------+-----------
 app_name     | text                     | not null
 next_run     | timestamp with time zone |
 first_run    | timestamp with time zone |
 last_run     | timestamp with time zone |
 last_success | timestamp with time zone |
 error_count  | integer                  | default 0
 depends_on   | text[]                   |
 last_error   | json                     |
 ongoing      | timestamp with time zone |
Indexes:
    "crontabber_unique_app_name_idx" UNIQUE, btree (app_name)

The last column, ongoing used to be just for the "curiosity". For example, in Socorro we used that to display a flashing message about which jobs are ongoing right now.

As of version 0.18, this ongoing column is actually used to NOT run apps again. Basically, when started, crontabber figures out which app to run next (assuming it's time to run it) and now the first thing it does is look up if it's ongoing already, and if it is the whole crontabber application exits with an error code of 3.

Sub-second locking

What might happen is that two separate servers which almost perfectly synchronoized clocks might have cron run crontabber at the "exact" same time. Or rather, only a few milliseconds apart. But the database is central so what might happen is that two distinct PostgreSQL connection tries to send a... UPDATE crontabber SET ongoing=now() WHERE app_name='some-app-name' at the very same time.

So how is this solved? The answer is row-level locking. The magic sauce is here. You make a select, by app_name with a suffix of FOR UPDATE WAIT. Imagine two distinct PostgreSQL connections sending this:

BEGIN;
SELECT ongoing FROM crontabber WHERE app_name = 'my-app-name'
FOR UPDATE NOWAIT;

-- do some other stuff in Python

UPDATE crontabber SET ongoing = now() WHERE app_name = 'my-app-name';
COMMIT;

One of them will succeed the other will raise an error. Now all you need to do is catch that raised error, check that it's a row-level locking error and not some other general error. Instead of worrying about the raised error you just accept it and exit the program early.

This screenshot of a test.sql script demonstrates this:

Two distinct terminals sending an UPDATE to psql. One will error.
Two terminals lined up and I start one and quickly switch and start the other one

Another way to demonstrate this is to use psycopg2 in a little script:

import threading
import psycopg2


def updater():
    connection = psycopg2.connect('dbname=crontabber_exampleapp')
    cursor = connection.cursor()
    cursor.execute("""
    SELECT ongoing FROM crontabber WHERE app_name = 'bar'
    FOR UPDATE NOWAIT
    """)
    cursor.execute("""
    UPDATE crontabber SET ongoing = now() WHERE app_name = 'bar'
    """)
    print("JOB CAN START!")
    connection.commit()


# Use threads to simulate starting two connections virtually 
# simultaneously.
threads = [
    threading.Thread(target=updater),
    threading.Thread(target=updater),
]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

The output of this is:

▶ python /tmp/test.py
JOB CAN START!
Exception in thread Thread-1:
Traceback (most recent call last):
...
OperationalError: could not obtain lock on row in relation "crontabber"

With threads, you never know exactly which one will work and which one will not. In this case it was Thread-1 that sent its SQL a couple of nanoseconds too late.

In conclusion...

As of version 0.18 of crontabber, all locking is now dealt with inside crontabber. You still kick off crontabber from cron or crontab but if your cron does kick it off whilst it's still in the midst of running a job, it will simply exit with an error code of 2 or 3.

In other words, you can now have multiple servers with crontabber, all talking to a central PostgreSQL for its state, and not have to worry about jobs being started more than exactly once. This will be super useful if your crontabber apps are such that they kick of stored procedures that would freak out if run more than once with the same parameters.

Using Fanout.io in Django

13 December 2016 1 comment   Python, Web development, Django, Mozilla, Javascript


Earlier this year we started using Fanout.io in Air Mozilla to enhance the experience for users awaiting content updates. Here I hope to flesh out its details a bit to inspire others to deploy a similar solution.

What It Is

First of all, Fanout.io is basically a service that handles your WebSockets. You put in some of Fanout's JavaScript into your site that handles a persistent WebSocket connection between your site and Fanout.io. And to push messages to your user you basically send them to Fanout.io from the server and they "forward" it to the WebSocket.

The HTML page looks like this:

<html>
<body>

  <h1>Web Page</h1>

<!-- replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page -->
<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
></script>
<script src="fanout.js"></script>
</body>
</html>

And the fanout.js script looks like this:

window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
     console.log('Incoming updated data from the server:', data);
  })
};

And in server it looks something like this:

from django.conf import settings
import fanout

fanout.realm = settings.FANOUT_REALM_ID
fanout.key = settings.FANOUT_REALM_KEY


def post_comment(request):
    """A django view function that saves the posted comment"""
   text = request.POST['comment']
   saved_comment = Comment.objects.create(text=text, user=request.user)
   fanout.publish('mycomments', {'new_comment': saved_comment.id})
   return http.JsonResponse({'comment_posted': True})

Note that, in the client-side code, there's no security since there's no authentication. Any client can connect to any channel. So it's important that you don't send anything sensitive. In fact, you should think of this pattern simply as a hint that something has changed. For example, here's a slightly more fleshed out example of how you'd use the subscription.

window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
    if (data.new_comment) {
      // server says a new comment has been posted in the server
      $.json('/comments', function(response) {
        $('#comments .comment').remove();
        $.each(response.comments, function(comment) {        
          $('<div class="comment">')
          .append($('<p>').text(comment.text))
          .append($('<span>').text('By: ' + comment.user.name))
          .appendTo('#comments');
        });
      });
    }
  })
};

Yes, I know jQuery isn't hip but it demonstrates the pattern well. Also, in the real world you might not want to ask the server for all comments (and re-render) but instead do an AJAX query to get all new comments since some parameter or something.

Why It's Awesome

It's awesome because you can have a simple page that updates near instantly when the server's database is updated. The alternative would be to do a setInterval loop that frequently does an AJAX query to see if there's new content to update. This is cumbersome because it requires a lot heavier AJAX queries. You might want to make it secure so you engage sessions that need to be looked up each time. Or, since you're going to request it often you have to write a very optimized server-side endpoint that is cheap to query often.

And last but not least, if you rely on an AJAX loop interval, you have to pick a frequency that your server can cope with and it's likely to be in the range of several seconds or else it might overload the server. That means that updates are quite delayed.

But maybe most important, you don't need to worry about running a WebSocket server. It's not terribly hard to do one yourself on your laptop with a bit of Node Express or Tornado but now you have yet another server to maintain and it, internally, needs to be connected to a "pub-sub framework" like Redis or a full blown message queue.

Alternatives

Fanout.io is not the only service that offers this. The decision to use Fanout.io was taken about a year ago and one of the attractive things it offers is that it's got a freemium option which is ideal for doing local testing. The honest truth is that I can't remember the other justifications used to chose Fanout.io over its competitors but here are some alternatives that popped up on a quick search:

It seems they all (including Fanout.io) has freemium plans, supports authentication, REST APIs (for sending and for querying connected clients' stats).

There are also some more advanced feature packed solutions like Meteor, Firebase and GunDB that act more like databases that are connected via WebSockets or alike. For example, you can have a database as a "conduit" for pushing data to a client. Meaning, instead of sending the data from the server directly you save it in a database which syncs to the connected clients.

Lastly, I've heard that Heroku has a really neat solution that does something similar whereby it sets up something similar as an extension.

Let's Get Realistic

The solution sketched out above is very simplistic. There are a lot more fine-grained details that you'd probably want to zoom in to if you're going to do this properly.

Throttling

In Air Mozilla, we call fanout.publish(channel, message) from a post_save ORM signal. If you have a lot of saves for some reason, you might be sending too many messages to the client. A throttling solution, per channel, simply makes sure your "callback" gets called only once per channel per small time frame. Here's the solution we employed:

window.Fanout = (function() {
  var _locks = {};
  return {
    subscribe: function subscribe(channel, callback) {
      _client.subscribe(channel, function(data) {
          if (_locks[channel]) {
              // throttled
              return;
          }
          _locks[channel] = true;
          callback(data);
          setTimeout(function() {
              _locks[channel] = false;
          }, 500);
      });        
    };
  }
})();

Subresource Integrity

Subresource integrity is an important web security technique where you know in advance a hash of the remote JavaScript you include. That means that if someone hacks the result of loading https://cdn.example.com/somelib.js the browser compares the hash of that with a hash mentioned in the <script> tag and refuses to load it if the hash doesn't match.

In the example of Fanout.io it actually looks like this:

<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
  crossOrigin="anonymous"
  integrity="sha384-/9uLm3UDnP3tBHstjgZiqLa7fopVRjYmFinSBjz+FPS/ibb2C4aowhIttvYIGGt9"
></script>

The SHA you get from the Fanout.io documentation. It requires, and implies, that you need to use an exact version of the library. You can't use it like this: <script src="https://cdn.example/somelib.latest.min.js" ....

WebSockets vs. Long-polling

Fanout.io's JavaScript client follows a pattern that makes it compatible with clients that don't support WebSockets. The first technique it uses is called long-polling. With this the server basically relys on standard HTTP techniques but the responses are long lasting instead. It means the request simply takes a very long time to respond and when it does, that's when data can be passed.

This is not a problem for modern browsers. They almost all support WebSocket but you might have an application that isn't a modern browser.

Anyway, what Fanout.io does internally is that it first creates a long-polling connection but then shortly after tries to "upgrade" to WebSockets if it's supported. However, the projects I work only need to support modern browsers and there's a trick to tell Fanout to go straight to WebSockets:

var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux', {
    // What this means is that we're opting to have
    // Fanout *start* with fancy-pants WebSocket and
    // if that doesn't work it **falls back** on other
    // options, such as long-polling.
    // The default behaviour is that it starts with
    // long-polling and tries to "upgrade" itself
    // to WebSocket.
    transportMode: 'fallback'
});

Fallbacks

In the case of Air Mozilla, it already had a traditional solution whereby it does a setInterval loop that does an AJAX query frequently.

Because the networks can be flaky or because something might go wrong in the client, the way we use it is like this:

var RELOAD_INTERVAL = 5;  // seconds

if (typeof window.Fanout !== 'undefined') {
    Fanout.subscribe('/' + container.data('subscription-channel-comments'), function(data) {
        // Supposedly the comments have changed.
        // For security, let's not trust the data but just take it
        // as a hint that it's worth doing an AJAX query
        // now.
        Comments.load(container, data);
    });
    // If Fanout doesn't work for some reason even though it
    // was made available, still use the regular old
    // interval. Just not as frequently.
    RELOAD_INTERVAL = 60 * 5;
}
setInterval(function() {
    Comments.reload_loop(container);
}, RELOAD_INTERVAL * 1000);

Use Fanout Selectively/Progressively

In the case of Air Mozilla, there are lots of pages. Some don't ever need a WebSocket connection. For example, it might be a simple CRUD (Create Update Delete) page. So, for that I made the whole Fanout functionality "lazy" and it only gets set up if the page has some JavaScript that knows it needs it.

This also has the benefit that the Fanout resource loading etc. is slightly delayed until more pressing things have loaded and the DOM is ready.

You can see the whole solution here. And the way you use it here.

Have Many Channels

You can have as many channels as you like. Don't create a channel called comments when you can have a channel called comments-123 where 123 is the ID of the page you're on for example.

In the case of Air Mozilla, there's a channel for every single page. If you're sitting on a page with a commenting widget, it doesn't get WebSocket messages about newly posted comments on other pages.

Conclusion

We've now used Fanout for almost a year in our little Django + jQuery app and it's been great. The management pages in Air Mozilla use AngularJS and the integration looks like this in the event manager page:

window.Fanout.subscribe('/events', function(data) {
    $scope.$apply(lookForModifiedEvents);
});

Fanout.io's been great to us. Really responsive support and very reliable. But if I were to start a fresh new project that needs a solution like this I'd try to spend a little time to investigate the competitors to see if there are some neat features I'd enjoy.

UPDATE

Fanout reached out to help explain more what's great about Fanout.io

"One of Fanout's biggest differentiators is that we use and promote open technologies/standards. For example, our service supports the open Bayeux protocol, and you can connect to it with any compatible client library, such as Faye. Nearly all competing services have proprietary protocols. This "open" aspect of Fanout aligns pretty well with Mozilla's values, and in fact you'd have a hard time finding any alternative that works the same way."

How to track Google Analytics pageviews on non-web requests (with Python)

03 May 2016 1 comment   Python, Web development, Django, Mozilla


tl;dr; Use raven's ThreadedRequestsHTTPTransport transport class to send Google Analytics pageview trackings asynchronously to Google Analytics to collect pageviews that aren't actually browser pages.

We have an API on our Django site that was not designed from the ground up. We had a bunch of internal endpoints that were used by the website. So we simply exposed those as API endpoints that anybody can query. All we did was wrap certain parts carefully as to not expose private stuff and we wrote a simple web page where you can see a list of all the endpoints and what parameters are needed. Later we added auth-by-token.

Now the problem we have is that we don't know which endpoints people use and, as equally important, which ones people don't use. If we had more stats we'd be able to confidently deprecate some (for easier maintanenace) and optimize some (to avoid resource overuse).

Our first attempt was to use statsd to collect metrics and display those with graphite. But it just didn't work out. There are just too many different "keys". Basically, each endpoint (aka URL, aka URI) is a key. And if you include the query string parameters, the number of keys just gets nuts. Statsd and graphite is better when you have about as many keys as you have fingers on one hand. For example, HTTP error codes, 200, 302, 400, 404 and 500.

Also, we already use Google Analytics to track pageviews on our website, which is basically a measure of how many people render web pages that have HTML and JavaScript. Google Analytic's UI is great and powerful. I'm sure other competing tools like Mixpanel, Piwik, Gauges, etc are great too, but Google Analytics is reliable, likely to stick around and something many people are familiar with.

So how do you simulate pageviews when you don't have JavaScript rendering? The answer; using plain HTTP POST. (HTTPS of course). And how do you prevent blocking on sending analytics without making your users have to wait? By doing it asynchronously. Either by threading or a background working message queue.

Threading or a message queue

If you have a message queue configured and confident in its running, you should probably use that. But it adds a certain element of complexity. It makes your stack more complex because now you need to maintain a consumer(s) and the central message queue thing itself. What if you don't have a message queue all set up? Use Python threading.

To do the threading, which is hard, it's always a good idea to try to stand on the shoulder of giants. Or, if you can't find a giant, find something that is mature and proven to work well over time. We found that in Raven.

Raven is the Python library, or "agent", used for Sentry, the open source error tracking software. As you can tell by the name, Raven tries to be quite agnostic of Sentry the server component. Inside it, it has a couple of good libraries for making threaded jobs whose task is to make web requests. In particuarly, the awesome ThreadedRequestsHTTPTransport. Using it basically looks like this:

import urlparse
from raven.transport.threaded_requests import ThreadedRequestsHTTPTransport

transporter = ThreadedRequestsHTTPTransport(
    urlparse.urlparse('https://ssl.google-analytics.com/collect'),
    timeout=5
)

params = {
    ...more about this later...
}

def success_cb():
    print "Yay!"

def failure_cb(exception):
    print "Boo :("

transporter.async_send(
    params,
    headers,
    success_cb,
    failure_cb
)

The call isn't very different from regular plain old requests.post.

About the parameters

This is probably the most exciting part and the place where you need some thought. It's non-trivial because you might need to put some careful thought into what you want to track.

Your friends is: This documentation page

There's also the Hit Builder tool where you can check that the values you are going to send make sense.

Some of the basic ones are easy:

"Protocol Version"

Just set to v=1

"Tracking ID"

That code thing you see in the regular chunk of JavaScript you put in the head, e.g tid=UA-1234-Z

"Data Source"

Optional word you call this type of traffic. We went with ds=api because we use it to measure the web API.

The user ones are a bit more tricky. Basically because you don't want to accidentally leak potentially sensitive information. We decided to keep this highly anonymized.

"Client ID"

A random UUID (version 4) number that identifies the user or the app. Not to be confused with "User ID" which is basically a string that identifies the user's session storage ID or something. Since in our case we don't have a user (unless they use an API token) we leave this to a new random UUID each time. E.g. cid=uuid.uuid4().hex This field is not optional.

"User ID"

Some string that identifies the user but doesn't reveal anything about the user. For example, we use the PostgreSQL primary key ID of the user as a string. It just means we can know if the same user make several API requests but we can never know who that user is. Google Analytics uses it to "lump" requests together. This field is optional.

Next we need to pass information about the hit and the "content". This is important. Especially the "Hit type" because this is where you make your manually server-side tracking act as if the user had clicked around on the website with a browser.

"Hit type"

Set this to t=pageview and it'll show up Google Analytics as if the user had just navigated to the URL in her browser. It's kinda weird to do this because clearly the user hasn't. Most likely she's used curl or something from the command line. So it's not really a pageview but, on our end, we have "views" in the webserver that produce information to the user. Some of it is HTML and some of it is JSON, in terms of output format, but either way they're sending us a URL and we respond with data.

"Document location URL"

The full absolute URL of that was used. E.g. https://www.example.com/page?foo=bar. So in our Django app we set this to dl=request.build_absolute_uri(). If you have a site where you might have multiple domains in use but want to collect them all under just 1 specific domain you need to set dh=example.com.

"Document Host Name" and "Document Path"

I actually don't know what the point of this is if you've already set the "Document location URL".

"Document Title"

In Google Analytics you can view your Content Drilldown by title instead of by URL path. In our case we set this to a string we know from the internal Python class that is used to make the API endpoint. dt='API (%s)'%api_model.__class__.__name__.

There are many more things you can set, such as the clients IP, the user agent, timings, exceptions. We chose to NOT include the user's IP. If people using the JavaScript version of Google Analytics can set their browser to NOT include the IP, we should respect that. Also, it's rarely interesting to see where the requests for a web API because it's often servers' curl or requests that makes the query, not the human.

Sample implementation

Going back to the code example mentioned above, let's demonstrate a fuller example:

import urlparse
from raven.transport.threaded_requests import ThreadedRequestsHTTPTransport

transporter = ThreadedRequestsHTTPTransport(
    urlparse.urlparse('https://ssl.google-analytics.com/collect'),
    timeout=5
)

# Remember, this is a Django, but you get the idea

domain = settings.GOOGLE_ANALYTICS_DOMAIN
if not domain or domain == 'auto':
    domain = RequestSite(request).domain

params = {
    'v': 1,
    'tid': settings.GOOGLE_ANALYTICS_ID,
    'dh': domain,
    't': 'pageview,
    'ds': 'api',
    'cid': uuid.uuid4().hext,
    'dp': request.path,
    'dl': request.build_request_uri(),
    'dt': 'API ({})'.format(model_class.__class__.__name__),
    'ua': request.META.get('HTTP_USER_AGENT'),
}

def success_cb():
    logger.info('Successfully informed Google Analytics (%s)', params)

def failure_cb(exception):
    logger.exception(exception)

transporter.async_send(
    params,
    headers,
    success_cb,
    failure_cb
)

How to unit test this

The class we're using, ThreadedRequestsHTTPTransport has, as you might have seen, a method called async_send. There's also one, with the exact same signature, called sync_send which does the same thing but in a blocking fashion. So you could make your code look someting silly like this:

def send_tracking(page_title, request, async=True):
    # ...same as example above but wrapped in a function...
    function = async and transporter.async_send or transporter.sync_send
    function(
        params,
        headers,
        success_cb,
        failure_cb
    )

And then in your tests you pass in async=False instead.
But don't do that. The code shouldn't be sub-serviant to the tests (unless it's for the sake of splitting up monster-long functions).
Instead, I recommend you mock the inner workings of that ThreadedRequestsHTTPTransport class so you can make the whole operation synchronous. For example...

import mock
from django.test import TestCase
from django.test.client import RequestFactory

from where.you.have import pageview_tracking


class TestTracking(TestCase):

    @mock.patch('raven.transport.threaded_requests.AsyncWorker')
    @mock.patch('requests.post')
    def test_pageview_tracking(self, rpost, aw):

        def mocked_queue(function, data, headers, success_cb, failure_cb):
            function(data, headers, success_cb, failure_cb)

        aw().queue.side_effect = mocked_queue

        request = RequestFactory().get('/some/page')
        with self.settings(GOOGLE_ANALYTICS_ID='XYZ-123'):
            pageview_tracking('Test page', request)

            # Now we can assert that 'requests.post' was called.
            # Left as an exercise to the reader :)
            print rpost.mock_calls       

This is synchronous now and works great. It's not finished. You might want to write a side effect for the requests.post so you can have better control of that post. That'll also give you a chance to potentially NOT return a 200 OK and make sure that your failure_cb callback function gets called.

How to manually test this

One thing I was very curious about when I started was to see how it worked if you really ran this for reals but without polluting your real Google Analytics account. For that I built a second little web server on the side, whose address I used instead of https://ssl.google-analytics.com/collect. So, change your code so that https://ssl.google-analytics.com/collect is not hardcoded but a variable you can change locally. Change it to http://localhost:5000/ and start this little Flask server:

import time
import random
from flask import Flask, abort, request

app = Flask(__name__)
app.debug = True

@app.route("/", methods=['GET', 'POST'])
def hello():
    print "- " * 40
    print request.method, request.path
    print "ARGS:", request.args
    print "FORM:", request.form
    print "DATA:", repr(request.data)
    if request.args.get('sleep'):
        sec = int(request.args['sleep'])
        print "** Sleeping for", sec, "seconds"
        time.sleep(sec)
        print "** Done sleeping."
    if random.randint(1, 5) == 1:
        abort(500)
    elif random.randint(1, 5) == 1:
        # really get it stuck now
        time.sleep(20)
    return "OK"

if __name__ == "__main__":
    app.run()

Now you get an insight into what gets posted and you can pretend that it's slow to respond. Also, you can get an insight into how your app behaves when this collection destination throws a 5xx error.

How to really test it

Google Analytics is tricky to test in that they collect all the stuff they collect then they take their time to process it and it then shows up the next day as stats. But, there's a hack! You can go into your Google Analytics account and click "Real-Time" -> "Overview" and you should see hits coming in as you're testing this. Obviously you don't want to do this on your real production account, but perhaps you have a stage/dev instance you can use. Or, just be patient :)

Whatsdeployed on only one site

26 February 2016 0 comments   Python, Web development, Mozilla

https://whatsdeployed.io/


Last year I developed a web app called "Whatsdeployed". It's one of those rare one-afternoon-hacks that turns out to be really really useful. I use it every [work]day. And I've heard many people say they use it too.

At the time I built it, it only supported comparing multiple instance. E.g. a production and a dev site. Or a test, stage and production. But oftentimes, especially for smaller projects, you might only just have your one deployed site.

So I've now made it possible so you can compare just 1 site against your github.com master branch.

For example: whatsdeployed.io/s-Sir

Or whatsdeployed.io/s-J14

What these do, is simply comparing what git sha revision is deployed on those side-projects, compared to the latest git sha on the master branch on github.com.

A quicksearch for Bugzilla using Autocompeter

27 January 2016 0 comments   Python, Web development, Mozilla, Javascript

http://codepen.io/peterbe/pen/adGNZr


Here's the final demo.

What I did was, I used the Bugzilla REST APIs to download all bugs for a specific product. Then I bulk-uploaded then to Autocompeter.com and lastly built a simply web front-end.

When you "download all" bugs with the Bugzilla REST API, it might be capped but I don't know what the limit is. The trick is to not download ALL bugs for the product in one big fat query, but to find out what all components are for that product and then download for each. The Python code is here.

Everyone's Invited to Play

So first you need to sign in on https://autocompeter.com using your GitHub account. Then you can generate a Auth-Key by picking a domain. The domain can be anything really. I picked bugzilla.mozilla.org but you can use whatever you like.

Then, when you have an Auth-Key you need to know the name of the product (or products) and run the script like this:

python download.py 7U4eFYH5cqR15m3ekuxkzaUR Socorro

Once you've done that, fork my codepen and replace the domain and any other references to the product.

Caveats

To make this really useful, you'd have to run it more often. Perhaps you can hook it up to a cron job or something and make it so that you only download, from the REST API, things that have changed since the last time you did a big download. Then you can let the cron job run frequently.

If you want really hot results, you could hook up a server-side service that consumes the Bugzfeed websocket.

Last but not least; this will never list private/secure bugs. Only publically available stuff.

The Future

If people enjoy it perhaps we can change the front-end demo so it's not hardcoded to one specific product ("Socorro" in my case). And it can be made pretty.

And the data would need to be downloaded and re-submitted more frequently. A quick Heroku app mayhaps?

Whatsdeployed

11 November 2015 4 comments   Python, Web development, Mozilla

http://whatsdeployed.io/


Whatsdeployed was a tool I developed for my work at Mozilla. I think many other organizations can benefit from using it too.

So, on many sites, what we do when deploying a site, is that we note which git sha was used and write that to a file which is then exposed via the web server. Like this for example. If you know that sha and what's at the tip of the master branch on the project's GitHub page, you can build up an interesting dashboard that allows you to see what's available and what's been deployed.

Sample Whatsdeployed screen for the Mozilla Socorro project
The other really useful case is when you have more than just one environment. For example, you might have a dev, stage and prod environment and, always lastly, the master branch on GitHub. Now you can see what code has been shipped on prod versus your staging environment for example.

This is one of those far too few projects that you build quickly one Friday afternoon and it turns out to be surprisingly useful to a lot of people. I for one, check various projects like this several times per day.

The code is on GitHub and it's basically a tiny bit of Flask with some jQuery doing a couple of AJAX requests. If you enjoy it and use it, please share.

mozjpeg installation and sample

10 October 2015 3 comments   Linux, Web development, Mozilla

https://github.com/mozilla/mozjpeg


I've written about mozjpeg before where I showed what it can do to a sample directory full of different kinds of JPEGs. But let's get more real. Let's actually install it and look at one thumbnail and one big photo.

To install, I used the pre-compiled binaries from this wonderful site. Like this:

# wget http://mozjpeg.codelove.de/bin/mozjpeg_3.1_amd64.deb
# dpkg -i mozjpeg_3.1_amd64.deb
# ls -l /opt/mozjpeg/bin/cjpeg
-rwxr-xr-x 1 root root 50784 Sep  3 19:03 /opt/mozjpeg/bin/cjpeg

I don't know why the binary executable becomes called cjpeg but that's fine. Let's put it in $PATH so other users can execute it:

# cd /usr/local/bin
# ln -s /opt/mozjpeg/bin/cjpeg

Now, let's actually use it for something. First we need a realistic lossy thumbnail that we can optimize.

$ wget http://cdn-2916.kxcdn.com/static/cache/eb/f0/ebf08e64e80170dc009e97f6f9681ceb.jpg

This was one of the thumbnails from a previous post called Panasonic Lumix from 2008 or a iPhone 5S from 2014.

Let's optimize!

$ jpeg -outfile ebf08e64e80170dc009e97f6f9681ceb.moz.jpg -optimise ebf08e64e80170dc009e97f6f9681ceb.jpg
$ ls -l ebf08e64e80170dc009e97f6f9681ceb.*
-rw-rw-r-- 1 django django 11391 Sep 26 17:04 ebf08e64e80170dc009e97f6f9681ceb.jpg
-rw-r--r-- 1 django django  9414 Oct 10 01:40 ebf08e64e80170dc009e97f6f9681ceb.moz.jpg

Yay! It's 17.4% smaller. Saving 1.93Kb.

So what do they look like? See for yourself:

I have to zoom in (⌘-+) 3 times until I can see any difference. But remember, the saving isn't massive but the usecase here is a thumbnail.

So, let's do the same with a non-thumbnail. Some huge JPEG.

$ time cjpeg -outfile Lumix-2.moz.jpg -optimise Lumix-2.jpg
real    0m3.285s
user    0m3.122s
sys     0m0.080s
$ ls -l Lumix*
-rw-rw-r-- 1 django django 4880446 Sep 26 17:20 Lumix-2.jpg
-rw-rw-r-- 1 django django 1546978 Oct 10 02:02 Lumix-2.moz.jpg
$ ls -lh Lumix*
-rw-rw-r-- 1 django django 4.7M Sep 26 17:20 Lumix-2.jpg
-rw-rw-r-- 1 django django 1.5M Oct 10 02:02 Lumix-2.moz.jpg

In other words, from 4.7Mb to 1.5Mb. It's 68.3% the size of the original. And the visual difference?

Again, I have to zoom in 3 times to be able to tell any difference and even when I've done that it's hard to tell which is which.

In conclusion, let's go ahead and use mozjpeg to optimize thumbnails.

Examples of mozjpeg savings

01 September 2015 5 comments   Web development, Mozilla


I'm currently working on a Django library that uses mozjpeg to optimize thumbnails that are generated from stored images. I first wanted to get a feel for how good mozjpeg really is.

In my ~/Downloads directory I have all sorts of "junk" from all sorts of saves and experiments. It'll work as a good testbed of relatively random JPEG images of all sorts of sizes and qualities. Without further ado, here's the results:

FILENAME                                          OPTIMIZE   ORIGINAL     SAVING  PERCENT
-----------------------------------------------------------------------------------------
180697_1836563311933_3364808_n.jpg                  45.2Kb     50.4Kb      5.1Kb    10.2%
2014-03-20 17.35.39.jpg                           2040.1Kb   2207.8Kb    167.7Kb     7.6%
2015-03-04 21.18.16.jpg                           1521.5Kb   1629.2Kb    107.7Kb     6.6%
2015-03-04 21.19.16.jpg                           1602.4Kb   1720.0Kb    117.6Kb     6.8%
2015-03-04 21.23.16.jpg                           1181.7Kb   1272.1Kb     90.4Kb     7.1%
2015-03-05 06.03.00.jpg                           1426.7Kb   1557.7Kb    131.0Kb     8.4%
20150626_200629_001.jpg                           1566.4Kb   1717.3Kb    151.0Kb     8.8%
20150626_200631.jpg                               2157.6Kb   2319.6Kb    162.0Kb     7.0%
Boba_Fett_by_RobD4E.jpg                             96.2Kb    104.3Kb      8.1Kb     7.8%
Horse_Play.jpg                                     170.4Kb    185.2Kb     14.9Kb     8.0%
Image (107).jpg                                    344.9Kb    390.6Kb     45.7Kb    11.7%
Misc Candle Holder NECA FOTR Balrog Dec2002.jpg     37.1Kb     37.7Kb      0.6Kb     1.5%
Mozilla_Lightbeam.jpg                               55.1Kb     79.7Kb     24.6Kb    30.8%
Photo on 12-17-14 at 5.55 PM.jpg                   168.5Kb    187.7Kb     19.2Kb    10.2%
dev.jpg                                             17.5Kb     30.8Kb     13.3Kb    43.2%
dev2.jpg                                            41.1Kb     54.3Kb     13.3Kb    24.4%
dev3.jpg                                            35.3Kb     49.0Kb     13.7Kb    28.0%
dev4.jpg                                            42.0Kb     56.0Kb     14.0Kb    25.0%
dev5.jpg                                            24.6Kb     37.9Kb     13.2Kb    35.0%
dev6.jpg                                            28.9Kb     42.8Kb     13.9Kb    32.4%
hr_0570_220_135__0570220135006.jpg                3124.3Kb   3467.8Kb    343.5Kb     9.9%
hr_0570_220_158__0570220158006.jpg                3010.0Kb   3319.1Kb    309.1Kb     9.3%
hr_0570_220_175__0570220175006.jpg                2245.5Kb   2442.6Kb    197.0Kb     8.1%
hr_0570_227_599__0570227599006.jpg                2561.7Kb   2809.8Kb    248.1Kb     8.8%
hr_0596_622_701__0596622701006.jpg                3238.8Kb   3453.6Kb    214.7Kb     6.2%
hr_0596_623_849__0596623849006.jpg                2902.9Kb   3102.1Kb    199.3Kb     6.4%
hr_0622_219_873__0622219873006.jpg                 985.3Kb   1066.9Kb     81.7Kb     7.7%
logo.jpg                                            43.5Kb     51.2Kb      7.7Kb    15.1%
mvm-header.jpg                                       8.5Kb     12.4Kb      3.9Kb    31.6%
mvm-postcard-picture.jpg                            72.2Kb     73.4Kb      1.3Kb     1.7%
overhang_pixels.jpg                               3014.3Kb   3370.8Kb    356.4Kb    10.6%
peterbe copy.jpg                                     4.2Kb     10.4Kb      6.2Kb    59.7%
peterbe.jpg                                         36.7Kb     44.3Kb      7.5Kb    17.0%
pjt-mcguinty-2.jpg                                  96.8Kb    101.6Kb      4.8Kb     4.8%
sl1.jpg                                             28.7Kb     35.4Kb      6.7Kb    18.9%

That's an median of 9.3% (average of 15.3%) savings.

It's not very fast though. Some of the large files take more than a second. In total it took 23.7 seconds to create all of those optimized files. Do what you want with that fact, bear in mind that these are hopefully "once in a lifetime" operations (depending on the ephemerality of your thumbnail storage). Mind you, the really large JPEGs skew that since the median is 72.1 milliseconds and average is 527.0 milliseconds. Also, when I look through the numbers I find that the large JPGs take the longest but had the least benefit in terms of byte savings.

UPDATE

Chris Adams, in the comment below, inspired me to compare my trials with jpegoptim and jpegrescan. So, I took my script that generated a directory of 45 JPEGs and changed it to use jpegoptim and jpegrescan.

The mozjpeg total size of that output directory is 34.1Mb and it took a total of 23.3 seconds (median 76.4 milliseconds).

The jpegoptim & jpegrescan total size of that output directory is 35.6Mb and it took a total of 4.6 seconds (median 32.1 milliseconds).

In other words, roughly speaking mozjpeg is 4.2% more space effective and 58% slower than jpegoptim & jpegrescan.

Crash-stats just became a whole lot faster

25 August 2015 0 comments   Web development, Mozilla


tl;dr Crash-stats is Mozilla's crash reporter dashboard. Simply fixing the static assets made the site 25% faster.

Before http://www.webpagetest.org/result/150820_X5_V5T/

After http://www.webpagetest.org/result/150824_7F_1C3Q/

(The "First Byte Time" is still terrible but that's for another discussion. We're working on a re-write of the underlying data model for that particular report.)

The only thing we changed was a long overdue correction of static asset headers and Gzip compression. Now, files with unique URLs (e.g. /static/CACHE/css/23a811f100bc.css) have maximum aggressive cache headers. And now all .js, .css and text/html is Gzipped.

Was it easy to do? Hell no!
Does it matter? Hell yeah! We don't have a lot of users or traffic on these reports but the people who use them do this for a living and making the site feel snappier for them would make their lives more productive.