ElasticSearch, snowball analyzer and stop words

Friday, Sep 25, 2015
1 comment Python

Disclaimer: I'm an ElasticSearch noob. Go easy on me

I have an application that uses ElasticSearch's more_like_this query to find related content. It basically works like this:

>>> index(index, doc_type, {'id': 1, 'title': 'Your cool title is here'})
>>> index(index, doc_type, {'id': 2, 'title': 'About is a cool headline'})
>>> index(index, doc_type, {'id': 3, 'title': 'Titles are your big thing'})

Then you can pick one ID (1, 2 or 3) and find related ones.
We can tell by looking at these three silly examples, the 1 and 2 have the words "is" and "cool" in common. 1 and 3 have "title" (stemming taken into account) and "your" in common. However, is there much value in connected these documents on the words "is" and "your"? I think not. Those are stop words. E.g. words like "the", "this", "from", "she" etc. Basically words that are commonly used as "glue" between more unique and specific words.

Anyway, if you index something in ElasticSearch as a text field you get, by default, the "standard" analyzer to analyze the incoming stuff to be indexed. The standard analyzer just splits the words on whitespace. A more compelling analyzer is the Snowball analyzer (original here) which supports intelligent stemming (turning "wife" ~= "wives") and stop words.

The problem is that the snowball analyzer has a very different set of stop words. We did some digging and thought this was the list it bases its English stop words on. But this was wrong. Note that that list has words like "your" and "about" listed there.

The way to find out how your analyzer treats a string and turns it into token is to the the _analyze tool. For example:

curl -XGET 'localhost:9200/{myindexname}/_analyze?analyzer=snowball' -d 'about your special is a the word' | json_print
{
  "tokens": [
    {
      "end_offset": 5,
      "token": "about",
      "type": "<ALPHANUM>",
      "start_offset": 0,
      "position": 1
    },
    {
      "end_offset": 10,
      "token": "your",
      "type": "<ALPHANUM>",
      "start_offset": 6,
      "position": 2
    },
    {
      "end_offset": 18,
      "token": "special",
      "type": "<ALPHANUM>",
      "start_offset": 11,
      "position": 3
    },
    {
      "end_offset": 32,
      "token": "word",
      "type": "<ALPHANUM>",
      "start_offset": 28,
      "position": 7
    }
  ]
}

So what you can see is that it finds the tokens "about", "your", "special" and "word". But it stop word ignored "is", "a" and "the". Hmm... I'm not happy with that. I don't think "about" and "your" are particularly helpful words.

So, how do you define your own stop words and override the one in the Snowball analyzer? Well, let me show you.

In code, I use pyelasticsearch so the index creation is done in Python.


STOPWORDS = (
    "a able about across after all almost also am among an and "
    "any are as at be because been but by can cannot could dear "
    "did do does either else ever every for from get got had has "
    "have he her hers him his how however i if in into is it its "
    "just least let like likely may me might most must my "
    "neither no nor not of off often on only or other our own "
    "rather said say says she should since so some than that the "
    "their them then there these they this tis to too twas us "
    "wants was we were what when where which while who whom why "
    "will with would yet you your".split()
)

def create():
    es = get_connection()
    index = get_index()
    es.create_index(index, settings={
        'settings': {
            'analysis': {
                'analyzer': {
                    'extended_snowball_analyzer': {
                        'type': 'snowball',
                        'stopwords': STOPWORDS,
                    },
                },
            },
        },
        'mappings': {
            doc_type: {
                'properties': {
                    'title': {
                        'type': 'string',
                        'analyzer': 'extended_snowball_analyzer',
                    },
                }
            }
        }
    })

With that in place, now delete your index and re-create it. Now you can use the _analyze tool again to see how it analyzes text on this particular field. But note, to do this we need to know the name of the index we used. (so replace {myindexname} in the URL):

$ curl -XGET 'localhost:9200/{myindexname}/_analyze?field=title' -d 'about your special is a the word' | json_print
{
  "tokens": [
    {
      "end_offset": 18,
      "token": "special",
      "type": "<ALPHANUM>",
      "start_offset": 11,
      "position": 3
    },
    {
      "end_offset": 32,
      "token": "word",
      "type": "<ALPHANUM>",
      "start_offset": 28,
      "position": 7
    }
  ]
}

Cool! Now we see that it considers "about" and "your" as stop words. Much better. This is handy too because you might have certain words that are globally not very common but within your application it's very repeated and not very useful.

Thank you willkg and Erik Rose for your support in tracking this down!

Comments

Anonymous October 4, 2017

This was helpful. Thank you for sharing!

Previous:: django-semanticui-form September 14, 2015 Python, Django
Next:: Panasonic Lumix from 2008 or a iPhone 5S from 2014 September 26, 2015 Photos

Related by category:: Using AI to rewrite blog post comments November 12, 2025 Python; Comparison of speed between gpt-5, gpt-5-mini, and gpt-5-nano December 15, 2025 Python; Autocomplete using PostgreSQL instead of Elasticsearch December 18, 2025 Python; logger.error or logger.exception in Python March 6, 2026 Python

Related by keyword:: Elasticsearch memory usage December 11, 2025 Linux, Elasticsearch; When Docker is too slow, use your host January 11, 2018 Web development, Django, Docker, macOS; Synonyms with elasticsearch-dsl December 5, 2017 Python, Web development, PostgreSQL; Podcasttime.io - How Much Time Do Your Podcasts Take To Listen To? February 13, 2017 Python, Web development, Django, React, JavaScript

ElasticSearch, snowball analyzer and stop words

Comments

Related posts