Skip to content

6923/feature/implementing fuzzy search#8873

Closed
benbdeitch wants to merge 24 commits intointernetarchive:masterfrom
benbdeitch:6923/feature/implementing-fuzzy-search
Closed

6923/feature/implementing fuzzy search#8873
benbdeitch wants to merge 24 commits intointernetarchive:masterfrom
benbdeitch:6923/feature/implementing-fuzzy-search

Conversation

@benbdeitch
Copy link
Copy Markdown
Collaborator

Closes #6923

This PR implements fuzzy searching on Solr for fieldless searches, as well as for searches for the 'title', 'alternative_title', 'author', and 'text' fields. Notably, this is currently only active for work searches (and related edition queries), but it can easily be transferred to author searches, etc, if requested.

Technical

Queries of this sort will match single word terms that can be transformed into one another by up to two operations of insertion, substitution, or deletion. This sort of functionality is built into Solr, and as such, editing the way in which we write queries was the only change needed.

Unfortunately, there are some limitations to fuzzy searching; it does not function with phrases, or a series of words wrapped in quotation marks. . Fortunately, this preserves the expected functionality of many search engines, as a means of requesting only exact matches.

In order to ease editing, a new constant in the Work SearchScheme was introduced to control which fields will be fuzzy when queried.

Testing

Simply search on Solr, under the books tag. Terms without a field will be affected, as will terms related to the fields mentioned

Screenshot

image

Stakeholders

@cdrini

benbdeitch and others added 17 commits December 8, 2023 16:47
…stant field introduced in worksearch, and a new function in openlibrary.solr.query_utils.py.
…stant field introduced in worksearch, and a new function in openlibrary.solr.query_utils.py.
…enbdeitch/openlibraryfork into 6923/feature/implementing-fuzzy-search
@natedunn
Copy link
Copy Markdown

Thanks so much for working on this. This would honestly save me such a headache if this got merged in soon. Wish I could contribute more, but I am just a mere consumer of the JSON API.

@benbdeitch
Copy link
Copy Markdown
Collaborator Author

It was honestly my pleasure, and I'm glad to hear that it'll help you out! I think there's currently some Solr performance issues being investigated, but I imagine it'll get merged in relatively soon.

@mekarpeles mekarpeles added the Priority: 2 Important, as time permits. [managed] label Mar 18, 2024
Copy link
Copy Markdown
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few code changes. Once #8821 is merged, I'll put this up on testing and A/B compare against prod using https://docs.google.com/spreadsheets/d/1BN5I7-OkTPaoTr2Es6jQ4O9ICWFmH0q9CP6kEgolCgg/edit#gid=1006480604 to see how we're doing!

(Note to self: Depending on results of testing, we might want to only fuzz strings that are greater than length 2)

…which will cause the traversal to not iterate further past objects of that type. This is needed in order to properly fuzzy out only the desired search fields, and should not affect other functions, due to it being initialized to 'None' by default.
@cdrini cdrini added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] and removed Priority: 2 Important, as time permits. [managed] labels Jun 17, 2024
@cdrini
Copy link
Copy Markdown
Collaborator

cdrini commented Jun 18, 2024

Ah shoot I put this on testing but it's causing a lot of timeouts; likely because of #5480 , since the search result sets are much larger now. I think this is blocked until that issue is resolved.

@cdrini cdrini added Priority: 2 Important, as time permits. [managed] State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] and removed Priority: 1 Do this week, receiving emails, time sensitive, . [managed] labels Jun 18, 2024
@RomanKlimov
Copy link
Copy Markdown

what's the status of this one?

@mekarpeles mekarpeles removed the Priority: 2 Important, as time permits. [managed] label Jun 9, 2025
@tfmorris
Copy link
Copy Markdown
Contributor

tfmorris commented Mar 3, 2026

I followed a link here from a blog post which said this was done, but it seems to have been abandoned a couple of years ago. What's the actual status?

This would be much better than the current fallback to full text search which is almost never what the user wants when they've simply made a typo.

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Mar 4, 2026
@cdrini
Copy link
Copy Markdown
Collaborator

cdrini commented Mar 5, 2026

Ah that was an error in the post; I corrected it. My last comment here likely still stands; when we last tested it, it had a significant performance impact, so I'm not sure if fuzzy search is a tenable solution to this problem. Closing this PR for now, since there isn't really a step forward with this approach. Our best bet would likely be trying to evaluate solr's "Did you mean" support to compare its performance, or to investigate solutions to #5480 .

@cdrini cdrini closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Response Issues which require feedback from lead State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for solr spell checking ("Did you mean?")

6 participants