Peter Bengtsson: Autocomplete using PostgreSQL instead of Elasticsearch

Here on my blog I have a site search. Before you search, there's autocomplete. The autocomplete is solved by usingdownshiftin React and on the backend, there's an API/api/v1/typeahead?q=bla. Up until today, that backend was powered by Elasticsearch. Now it's powered by PostgreSQL. Here's how I implemented it.

Indexing

A cron job loops over all titles in all blog posts and finds portions of the words in the titles as singles, doubles, and triples. For each one, the popularity of the blog post is accumulated to the extracted keywords and combos.

These are then inserted into a Django ORM model that looks like this:

classSearchTerm(models.Model): term = models.CharField(max_length=100, db_index=True) popularity = models.FloatField(default=0.0) add_date = models.DateTimeField(auto_now=True) index_version = models.IntegerField(default=0)classMeta: unique_together = ("term","index_version") indexes = [ GinIndex( name="plog_searchterm_term_gin_idx", fields=["term"], opclasses=["gin_trgm_ops"], ), ]

Theindex_versionis used like this, in the indexing code:

current_index_version = ( SearchTerm.objects.aggregate(Max("index_version"))["index_version__max"]or0) index_version = current_index_version +1... SearchTerm.objects.bulk_create(bulk) SearchTerm.objects.filter(index_version__lt=index_version).delete()

That means that I don't have to delete previous entries until new ones have been created. So if something goes wrong during the indexing, it doesn't break the API.
Essentially, there are about 13k entries in that model. For a very brief moment there are 2x13k entries and then back to 13k entries when the whole task is done.

Search

The search is done with theLIKEoperator.

peterbecom=# select term from plog_searchterm where term like 'za%'; term ----------------------------- zahid zappa zappa biography zappa biography barry zappa biography barry miles zappa blog (6 rows)

In Python, it's as simple as:

base_qs = SearchTerm.objects.all() qs = base_qa.filter(term__startswith=term.lower())

But suppose someone searches forbiowe want it to match things likefrank zappa biographyso what it actually does is:

fromdjango.db.modelsimportQ qs = base_qs.filter( Q(term__startswith=term.lower()) | Q(term__contains=f"{term.lower()}") )

Typo tolerance

This is done with the%operator.

peterbecom=# select term from plog_searchterm where term % 'frenk'; term -------- free frank freeze french (4 rows)

In the Django ORM it looks like this:

base_qs = SearchTerm.objects.all() qs = base_qs.filter(term__trigram_similar=term.lower())

And if that doesn't work, it gets even more desperate. It does this using thesimilarity()function. Looks like this in SQL:

peterbecom=# select term from plog_searchterm where similarity(term, 'zuppa') > 0.14; term ------------------- frank zappa zappa zappa biography radio frank zappa frank zappa blog zappa blog zurich (7 rows)

Note on typo tolerance

Most of the time, the most basic query works and yields results. I.e. the.filter(term__startswith=term.lower())query.
It's only if it yields fewer results than the pagination size. That's why the fault tolerance query is only-if-needed. This means, it might send 2 SQL select queries from Python to PostgreSQL. In Elasticsearch, you usually don't do this. You send multiple queries and boost the differently.

Itcanbe done with PostgreSQL too using anUNIONoperator so that you sendonebutmore complexquery.

Speed

It's hard to measure the true performance of these things because they're so fast that it's more about the network speed.

On my fast MacBook Pro M4, I ran about 50 realistic queries and measured the time it took each with this new PostgreSQL-based solution versus the previous Elasticsearch solution. They both take about 4ms per query. I suspect, that 90% of that 4ms is serialization & transmission, and not much time inside the database itself.

The number of rows it searches is only, at the time of writing, 13,000+ so it's hard to get a feel for how much faster Elasticsearch would be than PostgreSQL. But with a GIN index in PostgreSQL, it would have to scale much much larger to feel too slow.

About Elasticsearch

Elasticsearch is better than PostgreSQL at full-text search, including n-grams. Elasticsearch is highly optimized for these kinds of things and has powerful ways that you can make a query be aproductof how well it matched with each entry's popularity. With PostgreSQL that gets difficult.

But PostgreSQL is simple. It's solid and it doesn't take up nearlyas much memory as Elasticsearch.

http://www.peterbe.com/plog/autocomplete-using-postgresql-instead-of-elasticsearch

Follow us:

Applications

COMPANY

PRODUCTS

COMMUNITY

CHOOSE LANGUAGE