6923/feature/implementing fuzzy search#8873
6923/feature/implementing fuzzy search#8873benbdeitch wants to merge 24 commits intointernetarchive:masterfrom
Conversation
… search fields, specified on call.
…stant field introduced in worksearch, and a new function in openlibrary.solr.query_utils.py.
…stant field introduced in worksearch, and a new function in openlibrary.solr.query_utils.py.
… search fields, specified on call.
…enbdeitch/openlibraryfork into 6923/feature/implementing-fuzzy-search
|
Thanks so much for working on this. This would honestly save me such a headache if this got merged in soon. Wish I could contribute more, but I am just a mere consumer of the JSON API. |
|
It was honestly my pleasure, and I'm glad to hear that it'll help you out! I think there's currently some Solr performance issues being investigated, but I imagine it'll get merged in relatively soon. |
cdrini
left a comment
There was a problem hiding this comment.
Looks good! A few code changes. Once #8821 is merged, I'll put this up on testing and A/B compare against prod using https://docs.google.com/spreadsheets/d/1BN5I7-OkTPaoTr2Es6jQ4O9ICWFmH0q9CP6kEgolCgg/edit#gid=1006480604 to see how we're doing!
(Note to self: Depending on results of testing, we might want to only fuzz strings that are greater than length 2)
…which will cause the traversal to not iterate further past objects of that type. This is needed in order to properly fuzzy out only the desired search fields, and should not affect other functions, due to it being initialized to 'None' by default.
…ld_traverse function.
…lemented some unit testing.
…nd refactored accordingly.
|
Ah shoot I put this on testing but it's causing a lot of timeouts; likely because of #5480 , since the search result sets are much larger now. I think this is blocked until that issue is resolved. |
|
what's the status of this one? |
|
I followed a link here from a blog post which said this was done, but it seems to have been abandoned a couple of years ago. What's the actual status? This would be much better than the current fallback to full text search which is almost never what the user wants when they've simply made a typo. |
|
Ah that was an error in the post; I corrected it. My last comment here likely still stands; when we last tested it, it had a significant performance impact, so I'm not sure if fuzzy search is a tenable solution to this problem. Closing this PR for now, since there isn't really a step forward with this approach. Our best bet would likely be trying to evaluate solr's "Did you mean" support to compare its performance, or to investigate solutions to #5480 . |
Closes #6923
This PR implements fuzzy searching on Solr for fieldless searches, as well as for searches for the 'title', 'alternative_title', 'author', and 'text' fields. Notably, this is currently only active for work searches (and related edition queries), but it can easily be transferred to author searches, etc, if requested.
Technical
Queries of this sort will match single word terms that can be transformed into one another by up to two operations of insertion, substitution, or deletion. This sort of functionality is built into Solr, and as such, editing the way in which we write queries was the only change needed.
Unfortunately, there are some limitations to fuzzy searching; it does not function with phrases, or a series of words wrapped in quotation marks. . Fortunately, this preserves the expected functionality of many search engines, as a means of requesting only exact matches.
In order to ease editing, a new constant in the Work SearchScheme was introduced to control which fields will be fuzzy when queried.
Testing
Simply search on Solr, under the books tag. Terms without a field will be affected, as will terms related to the fields mentioned
Screenshot
Stakeholders
@cdrini