6923/feature/implementing fuzzy search#8873

benbdeitch · 2024-03-05T23:10:01Z

This PR implements fuzzy searching on Solr for fieldless searches, as well as for searches for the 'title', 'alternative_title', 'author', and 'text' fields. Notably, this is currently only active for work searches (and related edition queries), but it can easily be transferred to author searches, etc, if requested.

Technical

Queries of this sort will match single word terms that can be transformed into one another by up to two operations of insertion, substitution, or deletion. This sort of functionality is built into Solr, and as such, editing the way in which we write queries was the only change needed.

Unfortunately, there are some limitations to fuzzy searching; it does not function with phrases, or a series of words wrapped in quotation marks. . Fortunately, this preserves the expected functionality of many search engines, as a means of requesting only exact matches.

In order to ease editing, a new constant in the Work SearchScheme was introduced to control which fields will be fuzzy when queried.

Testing

Simply search on Solr, under the books tag. Terms without a field will be affected, as will terms related to the fields mentioned

Screenshot

Stakeholders

@cdrini

… search fields, specified on call.

…stant field introduced in worksearch, and a new function in openlibrary.solr.query_utils.py.

… search fields, specified on call.

…enbdeitch/openlibraryfork into 6923/feature/implementing-fuzzy-search

…_field returns

natedunn · 2024-03-16T23:04:09Z

Thanks so much for working on this. This would honestly save me such a headache if this got merged in soon. Wish I could contribute more, but I am just a mere consumer of the JSON API.

benbdeitch · 2024-03-18T17:06:56Z

It was honestly my pleasure, and I'm glad to hear that it'll help you out! I think there's currently some Solr performance issues being investigated, but I imagine it'll get merged in relatively soon.

cdrini

Looks good! A few code changes. Once #8821 is merged, I'll put this up on testing and A/B compare against prod using https://docs.google.com/spreadsheets/d/1BN5I7-OkTPaoTr2Es6jQ4O9ICWFmH0q9CP6kEgolCgg/edit#gid=1006480604 to see how we're doing!

(Note to self: Depending on results of testing, we might want to only fuzz strings that are greater than length 2)

openlibrary/solr/query_utils.py

…which will cause the traversal to not iterate further past objects of that type. This is needed in order to properly fuzzy out only the desired search fields, and should not affect other functions, due to it being initialized to 'None' by default.

…ld_traverse function.

…lemented some unit testing.

…nd refactored accordingly.

cdrini · 2024-06-18T21:27:08Z

Ah shoot I put this on testing but it's causing a lot of timeouts; likely because of #5480 , since the search result sets are much larger now. I think this is blocked until that issue is resolved.

RomanKlimov · 2024-11-24T19:40:17Z

what's the status of this one?

tfmorris · 2026-03-03T23:53:34Z

I followed a link here from a blog post which said this was done, but it seems to have been abandoned a couple of years ago. What's the actual status?

This would be much better than the current fallback to full text search which is almost never what the user wants when they've simply made a typo.

cdrini · 2026-03-05T16:07:04Z

Ah that was an error in the post; I corrected it. My last comment here likely still stands; when we last tested it, it had a significant performance impact, so I'm not sure if fuzzy search is a tenable solution to this problem. Closing this PR for now, since there isn't really a step forward with this approach. Our best bet would likely be trying to evaluate solr's "Did you mean" support to compare its performance, or to investigate solutions to #5480 .

benbdeitch and others added 17 commits December 8, 2023 16:47

Merge branch 'internetarchive:master' into master

df7b08a

Merge branch 'internetarchive:master' into master

c441caf

Merge branch 'master' of github.com:benbdeitch/openlibraryfork

bf3e39a

Merge branch 'master' of github.com:benbdeitch/openlibraryfork

50b1dc9

Merge branch 'internetarchive:master' into master

eb494cd

Merge branch 'master' of github.com:benbdeitch/openlibraryfork

fd02352

Added luqum_make_fuzzy, and adjusted it to function only on specified…

952c3ea

… search fields, specified on call.

Adjusted 'make fuzzy' function to properly obey all accepted behaviors.

28f162e

Implemented fuzzy search for worksearch.schemes.works.py, using a con…

60c380a

…stant field introduced in worksearch, and a new function in openlibrary.solr.query_utils.py.

Adjusted 'make fuzzy' function to properly obey all accepted behaviors.

b29100e

Implemented fuzzy search for worksearch.schemes.works.py, using a con…

e5f6372

…stant field introduced in worksearch, and a new function in openlibrary.solr.query_utils.py.

Merge branch 'internetarchive:master' into master

012030f

Added luqum_make_fuzzy, and adjusted it to function only on specified…

df91a5b

… search fields, specified on call.

Merge branch '6923/feature/implementing-fuzzy-search' of github.com:b…

e37f6ed

…enbdeitch/openlibraryfork into 6923/feature/implementing-fuzzy-search

Adjusted query parsing tests to account for new parsing rules.

4fcd7fb

Fixed additional error introduced into tests.

7593b77

Fixed error caused by accidentally changing format that luqum_replace…

90579b3

…_field returns

mekarpeles assigned cdrini Mar 11, 2024

mekarpeles added the Priority: 2 Important, as time permits. [managed] label Mar 18, 2024

cdrini requested changes Apr 3, 2024

View reviewed changes

benbdeitch added 7 commits April 4, 2024 11:48

Added additional test case.

5e4cd96

Renamed 'exclsion' to 'stop_at', and removed the deprecated luqum_fie…

6fe043e

…ld_traverse function.

Renamed search_fields argument in luqum_make_fuzzy to fields, and imp…

0aab3c1

…lemented some unit testing.

Altered the 'stop_at' argument to accept a set of types to halt at, a…

efb94ca

…nd refactored accordingly.

Fixed typo in the doctest strings.

3e1e4bf

Fixed slight error with the doctest string.

58ab4a1

mekarpeles mentioned this pull request May 7, 2024

Improve Search Results Experience when 0 results #1001

Closed

mekarpeles mentioned this pull request May 7, 2024

Improve Search Results Experience #9232

Open

14 tasks

cdrini added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] and removed Priority: 2 Important, as time permits. [managed] labels Jun 17, 2024

cdrini added Priority: 2 Important, as time permits. [managed] State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] and removed Priority: 1 Do this week, receiving emails, time sensitive, . [managed] labels Jun 18, 2024

mekarpeles removed the Priority: 2 Important, as time permits. [managed] label Jun 9, 2025

github-actions bot added the Needs: Response Issues which require feedback from lead label Mar 4, 2026

cdrini closed this Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

6923/feature/implementing fuzzy search#8873