Here on my blog I have a site search. Before you search, there's autocomplete. The autocomplete is solved by using downshift in React and on the backend, there's an API /api/v1/typeahead?q=bla. Up until today, that backend was powered by Elasticsearch. Now it's powered by PostgreSQL. Here's how I implemented it.
Indexing
A cron job loops over all titles in all blog posts and finds portions of the words in the titles as singles, doubles, and triples. For each one, the popularity of the blog post is accumulated to the extracted keywords and combos.
These are then inserted into a Django ORM model that looks like this:
class SearchTerm(models.Model):
term = models.CharField(max_length=100, db_index=True)
popularity = models.FloatField(default=0.0)
add_date = models.DateTimeField(auto_now=True)
index_version = models.IntegerField(default=0)
class Meta:
unique_together = ("term", "index_version")
indexes = [
GinIndex(
name="plog_searchterm_term_gin_idx",
fields=["term"],
opclasses=["gin_trgm_ops"],
),
]
The index_version is used like this, in the indexing code:
current_index_version = (
SearchTerm.objects.aggregate(Max("index_version"))["index_version__max"]
or 0
)
index_version = current_index_version + 1
...
SearchTerm.objects.bulk_create(bulk)
SearchTerm.objects.filter(index_version__lt=index_version).delete()
That means that I don't have to delete previous entries until new ones have been created. So if something goes wrong during the indexing, it doesn't break the API.
Essentially, there are about 13k entries in that model. For a very brief moment there are 2x13k entries and then back to 13k entries when the whole task is done.
Search
The search is done with the LIKE operator.
peterbecom=# select term from plog_searchterm where term like 'za%';
term
-----------------------------
zahid
zappa
zappa biography
zappa biography barry
zappa biography barry miles
zappa blog
(6 rows)
In Python, it's as simple as:
base_qs = SearchTerm.objects.all()
qs = base_qa.filter(term__startswith=term.lower())
But suppose someone searches for bio we want it to match things like frank zappa biography so what it actually does is:
from django.db.models import Q
qs = base_qs.filter(
Q(term__startswith=term.lower()) | Q(term__contains=f" {term.lower()}")
)
Typo tolerance
This is done with the % operator.
peterbecom=# select term from plog_searchterm where term % 'frenk';
term
--------
free
frank
freeze
french
(4 rows)
In the Django ORM it looks like this:
base_qs = SearchTerm.objects.all()
qs = base_qs.filter(term__trigram_similar=term.lower())
And if that doesn't work, it gets even more desperate. It does this using the similarity() function. Looks like this in SQL:
peterbecom=# select term from plog_searchterm where similarity(term, 'zuppa') > 0.14;
term
-------------------
frank zappa
zappa
zappa biography
radio frank zappa
frank zappa blog
zappa blog
zurich
(7 rows)
Note on typo tolerance
Most of the time, the most basic query works and yields results. I.e. the .filter(term__startswith=term.lower()) query.
It's only if it yields fewer results than the pagination size. That's why the fault tolerance query is only-if-needed. This means, it might send 2 SQL select queries from Python to PostgreSQL. In Elasticsearch, you usually don't do this. You send multiple queries and boost the differently.
It can be done with PostgreSQL too using an UNION operator so that you send one but more complex query.
Speed
It's hard to measure the true performance of these things because they're so fast that it's more about the network speed.
On my fast MacBook Pro M4, I ran about 50 realistic queries and measured the time it took each with this new PostgreSQL-based solution versus the previous Elasticsearch solution. They both take about 4ms per query. I suspect, that 90% of that 4ms is serialization & transmission, and not much time inside the database itself.
The number of rows it searches is only, at the time of writing, 13,000+ so it's hard to get a feel for how much faster Elasticsearch would be than PostgreSQL. But with a GIN index in PostgreSQL, it would have to scale much much larger to feel too slow.
About Elasticsearch
Elasticsearch is better than PostgreSQL at full-text search, including n-grams. Elasticsearch is highly optimized for these kinds of things and has powerful ways that you can make a query be a product of how well it matched with each entry's popularity. With PostgreSQL that gets difficult.
But PostgreSQL is simple. It's solid and it doesn't take up nearly as much memory as Elasticsearch.
Comments