[feature](search) introduce lucene bool mode for search function#59394
[feature](search) introduce lucene bool mode for search function#59394airborne12 merged 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 36862 ms |
TPC-DS: Total hot run time: 179700 ms |
ClickBench: Total hot run time: 28.42 s |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 35186 ms |
TPC-DS: Total hot run time: 179364 ms |
ClickBench: Total hot run time: 28.6 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 34694 ms |
TPC-DS: Total hot run time: 179520 ms |
ClickBench: Total hot run time: 28.45 s |
FE UT Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
run check_coverage |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…feature This commit adds the necessary dependency files from PR #58545 to fix compilation errors in the cherry-picked PR #59394 (lucene bool mode for search function). Changes include: - Updated clucene submodule to include skipToBlock/nextDeltaPosition methods - Added OccurBooleanQuery and related classes (occur.h, occur_boolean_query.h, occur_boolean_weight.h/cpp, boolean_query_builder.h) - Moved operator.h to boolean_query/ directory and fixed include paths - Updated function_search.h/cpp to use correct include paths - Various query_v2 file updates for compatibility Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…uery to branch-4.0 Cherry-pick the full implementation and unit tests from PR #58545 to branch-4.0. Most source code was already added in previous commits as dependencies for PR #59394. This commit completes the cherry-pick by adding: Source file fixes: - regexp_weight.cpp: Fixed to use make_segment_postings() helper Unit test files (new): - boolean_query/boolean_query_builder_test.cpp: Tests for query builders - buffered_union_test.cpp: Tests for BufferedUnion scorer - disjunction_scorer_test.cpp: Tests for DisjunctionScorer - exclude_scorer_test.cpp: Tests for ExcludeScorer - occur_boolean_query_test.cpp: Tests for OccurBooleanQuery - reqopt_scorer_test.cpp: Tests for ReqOptScorer Unit test files (updated to PR version): - boolean_query_test.cpp: Updated to use OperatorBooleanQueryBuilder - intersection_test.cpp: Updated API calls - segment_postings_test.cpp: Updated to PR version All tests compile and pass verification. Related PR: #58545 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…che#59394) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#58545 Problem Summary: This PR introduces two new features for the SEARCH function: #### 1. Lucene Boolean Mode Adds a `mode` option to enable Lucene/Elasticsearch-style query parsing: ```sql -- Enable Lucene mode via JSON options SELECT * FROM docs WHERE search('apple AND banana', '{"default_field":"title","mode":"lucene"}'); -- With minimum_should_match SELECT * FROM docs WHERE search('apple AND banana OR cherry', '{"default_field":"title","mode":"lucene","minimum_should_match":1}'); ``` **Key differences from standard mode:** - AND/OR/NOT work as left-to-right modifiers (not traditional boolean algebra) - Uses MUST/SHOULD/MUST_NOT internally (like Lucene's Occur enum) - Pure NOT queries return empty results (need positive clause) **Behavior comparison:** | Query | Standard Mode | Lucene Mode | |-------|--------------|-------------| | `a AND b` | a ∩ b | +a +b (both MUST) | | `a OR b` | a ∪ b | a b (both SHOULD, min=1) | | `NOT a` | ¬a | Empty (no positive clause) | | `a AND NOT b` | a ∩ ¬b | +a -b (MUST a, MUST_NOT b) | | `a AND b OR c` | (a ∩ b) ∪ c | +a b c (only a is MUST) | #### 2. Escape Characters in DSL Support for escaping special characters using backslash: | Escape | Description | Example | |--------|-------------|---------| | `\ ` | Literal space | `title:First\ Value` matches "First Value" | | `\(` `\)` | Literal parentheses | `title:hello\(world\)` matches "hello(world)" | | `\:` | Literal colon | `title:key\:value` matches "key:value" | | `\\` | Literal backslash | `title:path\\to\\file` matches "path\to\file" |
…unction #59394 (#59745) Cherry-picked from #59394 **Note:** This PR depends on #59766 (cherry-pick of #58545) being merged first. ## Summary Introduce lucene bool mode for search function. ## Test plan - [ ] Regression tests (after dependency PR merged) Related PRs: #59394 Depends on: #59766 Co-authored-by: Jack <jiangkai@selectdb.com>
… search function Add documentation for two new features in the SEARCH function: 1. Lucene Boolean Mode: - JSON-based options parameter (mode, minimum_should_match) - Left-to-right modifier parsing (MUST/SHOULD/MUST_NOT) - Behavior comparison table with standard mode 2. Escape Characters: - Support for escaping special characters in DSL - Backslash escapes for space, parentheses, colon, backslash Updated both English and Chinese versions of search-function.md. Related PR: apache/doris#59394 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
#59845) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #59394 Problem Summary: This PR adds `fields` and `type` parameters to the SEARCH function, allowing queries to search across multiple fields with a single query term. This is similar to Elasticsearch's multi_match query with `best_fields` and `cross_fields` types. #### Multi-Field Search Support ```sql -- Single term across multiple fields (best_fields mode - default) SELECT * FROM docs WHERE search('hello', '{"fields":["title","content"]}'); -- Equivalent to: (title:hello) OR (content:hello) -- Multi-term with AND operator (best_fields mode - default) SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and"}'); -- Equivalent to: (title:hello AND title:world) OR (content:hello AND content:world) -- Multi-term with cross_fields mode SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}'); -- Equivalent to: (title:hello OR content:hello) AND (title:world OR content:world) -- Combined with Lucene mode SELECT * FROM docs WHERE search('machine AND learning', '{"fields":["title","content"],"mode":"lucene","minimum_should_match":0}'); ``` #### Type Parameter Options | Type | Description | Behavior | |------|-------------|----------| | `best_fields` (default) | All terms must match within the **SAME** field | `"hello world"` → `(title:hello AND title:world) OR (content:hello AND content:world)` | | `cross_fields` | Terms can match across **DIFFERENT** fields | `"hello world"` → `(title:hello OR content:hello) AND (title:world OR content:world)` | **Key features:** - `type` parameter controls how terms are matched across fields - `best_fields` (default): Finds documents where all terms appear in the same field - ideal for relevance ranking - `cross_fields`: Treats multiple fields as one big field - ideal for name searches across first_name/last_name - Compatible with both standard mode and Lucene boolean mode - `fields` and `default_field` are mutually exclusive - Supports functions (EXACT, ANY, ALL) across fields - Supports wildcard queries across fields **Behavior examples:** | Query | Fields | Type | Expanded DSL | |-------|--------|------|--------------| | `hello` | `["title","content"]` | best_fields | `(title:hello) OR (content:hello)` | | `hello world` (AND) | `["title","content"]` | best_fields | `(title:hello AND title:world) OR (content:hello AND content:world)` | | `hello world` (AND) | `["title","content"]` | cross_fields | `(title:hello OR content:hello) AND (title:world OR content:world)` | | `EXACT(foo bar)` | `["title","content"]` | any | `(title:EXACT(foo bar) OR content:EXACT(foo bar))` | | `hello AND category:tech` | `["title","content"]` | any | `(title:hello OR content:hello) AND category:tech` | **Use case examples:** - **Product search**: Use `best_fields` when searching product name and description - prefer products where query terms appear together - **Person name search**: Use `cross_fields` when searching first_name and last_name - "John Smith" should match documents with `first_name:John` and `last_name:Smith` ### Release note - Add multi-field search support for SEARCH function (`fields` parameter) - Add `type` parameter with `best_fields` (default) and `cross_fields` modes - `best_fields`: All terms must match within the same field (default, matches Elasticsearch behavior) - `cross_fields`: Terms can match across different fields - Compatible with Lucene mode for MUST/SHOULD/MUST_NOT semantics
#59845) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #59394 Problem Summary: This PR adds `fields` and `type` parameters to the SEARCH function, allowing queries to search across multiple fields with a single query term. This is similar to Elasticsearch's multi_match query with `best_fields` and `cross_fields` types. #### Multi-Field Search Support ```sql -- Single term across multiple fields (best_fields mode - default) SELECT * FROM docs WHERE search('hello', '{"fields":["title","content"]}'); -- Equivalent to: (title:hello) OR (content:hello) -- Multi-term with AND operator (best_fields mode - default) SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and"}'); -- Equivalent to: (title:hello AND title:world) OR (content:hello AND content:world) -- Multi-term with cross_fields mode SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}'); -- Equivalent to: (title:hello OR content:hello) AND (title:world OR content:world) -- Combined with Lucene mode SELECT * FROM docs WHERE search('machine AND learning', '{"fields":["title","content"],"mode":"lucene","minimum_should_match":0}'); ``` #### Type Parameter Options | Type | Description | Behavior | |------|-------------|----------| | `best_fields` (default) | All terms must match within the **SAME** field | `"hello world"` → `(title:hello AND title:world) OR (content:hello AND content:world)` | | `cross_fields` | Terms can match across **DIFFERENT** fields | `"hello world"` → `(title:hello OR content:hello) AND (title:world OR content:world)` | **Key features:** - `type` parameter controls how terms are matched across fields - `best_fields` (default): Finds documents where all terms appear in the same field - ideal for relevance ranking - `cross_fields`: Treats multiple fields as one big field - ideal for name searches across first_name/last_name - Compatible with both standard mode and Lucene boolean mode - `fields` and `default_field` are mutually exclusive - Supports functions (EXACT, ANY, ALL) across fields - Supports wildcard queries across fields **Behavior examples:** | Query | Fields | Type | Expanded DSL | |-------|--------|------|--------------| | `hello` | `["title","content"]` | best_fields | `(title:hello) OR (content:hello)` | | `hello world` (AND) | `["title","content"]` | best_fields | `(title:hello AND title:world) OR (content:hello AND content:world)` | | `hello world` (AND) | `["title","content"]` | cross_fields | `(title:hello OR content:hello) AND (title:world OR content:world)` | | `EXACT(foo bar)` | `["title","content"]` | any | `(title:EXACT(foo bar) OR content:EXACT(foo bar))` | | `hello AND category:tech` | `["title","content"]` | any | `(title:hello OR content:hello) AND category:tech` | **Use case examples:** - **Product search**: Use `best_fields` when searching product name and description - prefer products where query terms appear together - **Person name search**: Use `cross_fields` when searching first_name and last_name - "John Smith" should match documents with `first_name:John` and `last_name:Smith` ### Release note - Add multi-field search support for SEARCH function (`fields` parameter) - Add `type` parameter with `best_fields` (default) and `cross_fields` modes - `best_fields`: All terms must match within the same field (default, matches Elasticsearch behavior) - `cross_fields`: Terms can match across different fields - Compatible with Lucene mode for MUST/SHOULD/MUST_NOT semantics
apache#59845) ### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#59394 Problem Summary: This PR adds `fields` and `type` parameters to the SEARCH function, allowing queries to search across multiple fields with a single query term. This is similar to Elasticsearch's multi_match query with `best_fields` and `cross_fields` types. #### Multi-Field Search Support ```sql -- Single term across multiple fields (best_fields mode - default) SELECT * FROM docs WHERE search('hello', '{"fields":["title","content"]}'); -- Equivalent to: (title:hello) OR (content:hello) -- Multi-term with AND operator (best_fields mode - default) SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and"}'); -- Equivalent to: (title:hello AND title:world) OR (content:hello AND content:world) -- Multi-term with cross_fields mode SELECT * FROM docs WHERE search('hello world', '{"fields":["title","content"],"default_operator":"and","type":"cross_fields"}'); -- Equivalent to: (title:hello OR content:hello) AND (title:world OR content:world) -- Combined with Lucene mode SELECT * FROM docs WHERE search('machine AND learning', '{"fields":["title","content"],"mode":"lucene","minimum_should_match":0}'); ``` #### Type Parameter Options | Type | Description | Behavior | |------|-------------|----------| | `best_fields` (default) | All terms must match within the **SAME** field | `"hello world"` → `(title:hello AND title:world) OR (content:hello AND content:world)` | | `cross_fields` | Terms can match across **DIFFERENT** fields | `"hello world"` → `(title:hello OR content:hello) AND (title:world OR content:world)` | **Key features:** - `type` parameter controls how terms are matched across fields - `best_fields` (default): Finds documents where all terms appear in the same field - ideal for relevance ranking - `cross_fields`: Treats multiple fields as one big field - ideal for name searches across first_name/last_name - Compatible with both standard mode and Lucene boolean mode - `fields` and `default_field` are mutually exclusive - Supports functions (EXACT, ANY, ALL) across fields - Supports wildcard queries across fields **Behavior examples:** | Query | Fields | Type | Expanded DSL | |-------|--------|------|--------------| | `hello` | `["title","content"]` | best_fields | `(title:hello) OR (content:hello)` | | `hello world` (AND) | `["title","content"]` | best_fields | `(title:hello AND title:world) OR (content:hello AND content:world)` | | `hello world` (AND) | `["title","content"]` | cross_fields | `(title:hello OR content:hello) AND (title:world OR content:world)` | | `EXACT(foo bar)` | `["title","content"]` | any | `(title:EXACT(foo bar) OR content:EXACT(foo bar))` | | `hello AND category:tech` | `["title","content"]` | any | `(title:hello OR content:hello) AND category:tech` | **Use case examples:** - **Product search**: Use `best_fields` when searching product name and description - prefer products where query terms appear together - **Person name search**: Use `cross_fields` when searching first_name and last_name - "John Smith" should match documents with `first_name:John` and `last_name:Smith` ### Release note - Add multi-field search support for SEARCH function (`fields` parameter) - Add `type` parameter with `best_fields` (default) and `cross_fields` modes - `best_fields`: All terms must match within the same field (default, matches Elasticsearch behavior) - `cross_fields`: Terms can match across different fields - Compatible with Lucene mode for MUST/SHOULD/MUST_NOT semantics
… search function (#3276) Add documentation for two new features in the SEARCH function: 1. Lucene Boolean Mode: - JSON-based options parameter (mode, minimum_should_match) - Left-to-right modifier parsing (MUST/SHOULD/MUST_NOT) - Behavior comparison table with standard mode 2. Escape Characters: - Support for escaping special characters in DSL - Backslash escapes for space, parentheses, colon, backslash Updated both English and Chinese versions of search-function.md. Related PR: apache/doris#59394 ## Versions - [ ] dev - [ ] 4.x - [ ] 3.x - [ ] 2.1 ## Languages - [ ] Chinese - [ ] English ## Docs Checklist - [ ] Checked by AI - [ ] Test Cases Built Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…59747) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #59394 Problem Summary: The search DSL should only recognize uppercase `AND`, `OR`, `NOT` as boolean operators in search lucene boolean mode. Previously, lowercase `and`, `or`, `not` were also treated as operators, which does not conform to the specification. This PR makes the boolean operators case-sensitive: - Only uppercase `AND`, `OR`, `NOT` are recognized as operators - Lowercase `and`, `or`, `not` are now treated as regular search terms - Using lowercase operators in DSL will result in a parse error ### Release note Make search DSL boolean operators (AND/OR/NOT) case-sensitive in lucene boolean mode.
…59747) ### What problem does this PR solve? Issue Number: close #xxx Related PR: #59394 Problem Summary: The search DSL should only recognize uppercase `AND`, `OR`, `NOT` as boolean operators in search lucene boolean mode. Previously, lowercase `and`, `or`, `not` were also treated as operators, which does not conform to the specification. This PR makes the boolean operators case-sensitive: - Only uppercase `AND`, `OR`, `NOT` are recognized as operators - Lowercase `and`, `or`, `not` are now treated as regular search terms - Using lowercase operators in DSL will result in a parse error ### Release note Make search DSL boolean operators (AND/OR/NOT) case-sensitive in lucene boolean mode.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #58545
Problem Summary:
This PR introduces two new features for the SEARCH function:
1. Lucene Boolean Mode
Adds a
modeoption to enable Lucene/Elasticsearch-style query parsing:Key differences from standard mode:
Behavior comparison:
a AND ba OR bNOT aa AND NOT ba AND b OR c2. Escape Characters in DSL
Support for escaping special characters using backslash:
\title:First\ Valuematches "First Value"\(\)title:hello\(world\)matches "hello(world)"\:title:key\:valuematches "key:value"\\title:path\\to\\filematches "path\to\file"Release note
mode: "lucene",minimum_should_match)Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)