About 2 years ago I launched Autocompeter.com. It was two parts:
2) A REST API where you can submit titles with a HTTP header key, and a fancy autocomplete search.
The second part has now been completely re-written. The server was originally written in Go and used Redis. Now it's Django and ElasticSearch.
The ultimate reason for this was that Redis was, by far, the biggest memory consumer on my shared DigitalOcean server. The way it worked was that every prefix of every word in every title was indexes as a key. For example the words
peter$ are all keys and they point to an array of IDs that you then look up to get the distinct set of titles and their URLs. This makes it really really fast but since redis doesn't support namespaces, or multiple columns it means that for every prefix it needs a prefix of its own for the domain they belong to. So the hash for
eb9f747 so the strings to store are instead
ElasticSearch on the other hand has ALL of this built in deep in Lucene. AND you can filter. So the way it's queried now instead is something like this:
search = TitleDoc.search() search = search.filter('term', domain=domain.name) search = search.query(Q('match_phrase', title=request.GET['q'])) search = search.sort('-popularity', '_score') search = search[:size] response = search.execute() ...
And here's how the mapping is defined:
from elasticsearch_dsl import ( DocType, Float, Text, Index, analyzer, Keyword, token_filter, ) edge_ngram_analyzer = analyzer( 'edge_ngram_analyzer', type='custom', tokenizer='standard', filter=[ 'lowercase', token_filter( 'edge_ngram_filter', type='edgeNGram', min_gram=1, max_gram=20 ) ] ) class TitleDoc(DocType): id = Keyword() domain = Keyword(required=True) url = Keyword(required=True, index=False) title = Text( required=True, analyzer=edge_ngram_analyzer, search_analyzer='standard' ) popularity = Float() group = Keyword()
I'm learning ElasticSearch rapidly but I still feel like I have so much to learn. This solution I have here is quite good and I'm pretty happy with the results but I bet there's a lot of things I can learn to make it even better.
I actually had a lot of fun building the first server version of Autocompeter in Go but Django is just so many times more convenient. It's got management commands, ORM, authentication system, CSRF protection, awesome error reporting, etc. All built in! With Go I had to build everything from scratch.
elasticsearch-dsl I think it wouldn't be too hard to re-write the critical query API in Go or in something like Sanic for maximum performance.
Oh, one of the reasons I wanted to do this new server in Python is because I want to learn Docker better and in particular Docker with Python projects.
The project is now entirely contained in Docker so you can start the PostgreSQL, ElasticSearch 5.1.1 and Django with
docker-compose up. There might be a couple of things I've forgot to document for how to configure things but this is actually the first time I've developed something entirely in Docker.