[feature](inverted index) Implement es-like boolean query#58545
[feature](inverted index) Implement es-like boolean query#58545airborne12 merged 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
zclllyybb
left a comment
There was a problem hiding this comment.
do we need some tests?
Temporary commit, code is not yet complete. |
48331af to
431d2bf
Compare
|
run buildall |
TPC-H: Total hot run time: 36138 ms |
TPC-DS: Total hot run time: 179060 ms |
ClickBench: Total hot run time: 27.34 s |
|
run buildall |
TPC-H: Total hot run time: 35414 ms |
TPC-DS: Total hot run time: 178707 ms |
ClickBench: Total hot run time: 27.18 s |
|
run buildall |
TPC-H: Total hot run time: 35988 ms |
TPC-DS: Total hot run time: 178006 ms |
ClickBench: Total hot run time: 27.47 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
8feba37 to
f864ddb
Compare
|
run buildall |
TPC-H: Total hot run time: 35117 ms |
TPC-DS: Total hot run time: 177988 ms |
ClickBench: Total hot run time: 27.57 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…feature This commit adds the necessary dependency files from PR #58545 to fix compilation errors in the cherry-picked PR #59394 (lucene bool mode for search function). Changes include: - Updated clucene submodule to include skipToBlock/nextDeltaPosition methods - Added OccurBooleanQuery and related classes (occur.h, occur_boolean_query.h, occur_boolean_weight.h/cpp, boolean_query_builder.h) - Moved operator.h to boolean_query/ directory and fixed include paths - Updated function_search.h/cpp to use correct include paths - Various query_v2 file updates for compatibility Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…uery to branch-4.0 Cherry-pick the full implementation and unit tests from PR #58545 to branch-4.0. Most source code was already added in previous commits as dependencies for PR #59394. This commit completes the cherry-pick by adding: Source file fixes: - regexp_weight.cpp: Fixed to use make_segment_postings() helper Unit test files (new): - boolean_query/boolean_query_builder_test.cpp: Tests for query builders - buffered_union_test.cpp: Tests for BufferedUnion scorer - disjunction_scorer_test.cpp: Tests for DisjunctionScorer - exclude_scorer_test.cpp: Tests for ExcludeScorer - occur_boolean_query_test.cpp: Tests for OccurBooleanQuery - reqopt_scorer_test.cpp: Tests for ReqOptScorer Unit test files (updated to PR version): - boolean_query_test.cpp: Updated to use OperatorBooleanQueryBuilder - intersection_test.cpp: Updated API calls - segment_postings_test.cpp: Updated to PR version All tests compile and pass verification. Related PR: #58545 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cherry-pick PR #58545 to branch-4.0. This PR implements ES-like boolean query for inverted index, including: Source files: - OccurBooleanQuery and related classes for ES-style MUST/SHOULD/MUST_NOT - OperatorBooleanQuery refactored to separate file - DisjunctionScorer, ExcludeScorer, ReqOptScorer implementations - BufferedUnion for efficient union operations - Updated intersection and segment_postings APIs Unit tests: - boolean_query_builder_test.cpp - buffered_union_test.cpp - disjunction_scorer_test.cpp - exclude_scorer_test.cpp - occur_boolean_query_test.cpp - reqopt_scorer_test.cpp - Updated existing test files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…58545 (#59766) ## Summary Cherry-pick PR #58545 to branch-4.0. This PR implements ES-like boolean query for inverted index, including: **Source files:** - OccurBooleanQuery and related classes for ES-style MUST/SHOULD/MUST_NOT - OperatorBooleanQuery refactored to separate file - DisjunctionScorer, ExcludeScorer, ReqOptScorer implementations - BufferedUnion for efficient union operations - Updated intersection and segment_postings APIs **Unit tests:** - boolean_query_builder_test.cpp - buffered_union_test.cpp - disjunction_scorer_test.cpp - exclude_scorer_test.cpp - occur_boolean_query_test.cpp - reqopt_scorer_test.cpp - Updated existing test files ## Test plan - [x] BE unit tests compile successfully - [x] Unit tests pass verification Related PR: #58545 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…che#59394) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…unction #59394 (#59745) Cherry-picked from #59394 **Note:** This PR depends on #59766 (cherry-pick of #58545) being merged first. ## Summary Introduce lucene bool mode for search function. ## Test plan - [ ] Regression tests (after dependency PR merged) Related PRs: #59394 Depends on: #59766 Co-authored-by: Jack <jiangkai@selectdb.com>
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)