[enhancement](parquet)support column predicate tree min-max filter for parquet page index.#57771
Conversation
…r parquet page index.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-DS: Total hot run time: 190261 ms |
ClickBench: Total hot run time: 27.44 s |
1988815 to
430ea3c
Compare
|
run buildall |
TPC-DS: Total hot run time: 189821 ms |
ClickBench: Total hot run time: 27.69 s |
|
run buildall |
TPC-H: Total hot run time: 34442 ms |
TPC-DS: Total hot run time: 188197 ms |
ClickBench: Total hot run time: 27.22 s |
|
run buildall |
TPC-H: Total hot run time: 34480 ms |
TPC-DS: Total hot run time: 187984 ms |
ClickBench: Total hot run time: 27.38 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the Parquet page index filtering implementation by improving the handling of OR predicates and removing the colname_to_value_range parameter from various reader initialization methods. The changes optimize page index filtering to better support complex predicate combinations.
Key Changes:
- Refactored page index filtering logic to support OR predicates with a new
evaluate_andmethod that works withCachedPageIndexStatandRowRanges - Removed the
colname_to_value_rangeparameter from all readerinit_readermethods (ORC, Parquet, Iceberg, Paimon, Hudi, Hive, etc.) - Introduced
RowRangesas a unified structure for representing row ranges to read, replacing the previousvector<RowRange>approach
Reviewed Changes
Copilot reviewed 38 out of 40 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/external_table_p0/hive/test_hive_page_index.groovy | New test suite for Hive page index filtering with various predicate combinations |
| docker/thirdparties/docker-compose/hive/scripts/preinstalled_data/parquet_table/decimals_1_10/decimals_1_10.parquet | Binary test data file for decimal column tests |
| docker/thirdparties/docker-compose/hive/scripts/create_preinstalled_scripts/run82.hql | HQL script to create test table for decimals |
| be/src/vec/exec/format/parquet/vparquet_reader.{h,cpp} | Major refactoring of page index filtering, min-max-bloom filter processing, and row group iteration logic |
| be/src/vec/exec/format/parquet/vparquet_group_reader.{h,cpp} | Updated to use RowRanges instead of vector<RowRange> |
| be/src/vec/exec/format/parquet/vparquet_column_reader.{h,cpp} | Updated column readers to work with RowRanges |
| be/src/vec/exec/format/parquet/parquet_predicate.h | Added PageIndexStat and CachedPageIndexStat structures, renamed get_min_max_value to parse_min_max_value |
| be/src/vec/exec/format/parquet/vparquet_page_index.{h,cpp} | Removed unused create_skipped_row_range method, made parse methods const |
| be/src/vec/exec/format/parquet/parquet_common.h | Replaced custom RowRange with segment_v2::RowRange and RowRanges |
| be/src/vec/exec/format/orc/vorc_reader.{h,cpp} | Removed colname_to_value_range parameter from init_reader |
| be/src/vec/exec/format/table/*.{h,cpp} | Updated table format readers (Iceberg, Paimon, Hudi, Hive, TransactionalHive) to remove colname_to_value_range parameter |
| be/src/vec/exec/scan/file_scanner.cpp | Updated all reader initialization calls to remove colname_to_value_range |
| be/src/olap/rowset/segment_v2/row_ranges.h | Added get_range method to RowRanges |
| be/src/olap/column_predicate.h | Added new evaluate_and method signature for page index filtering |
| be/src/olap/block_column_predicate.{h,cpp} | Implemented page index filtering for AND/OR block predicates |
| be/src/olap/comparison_predicate.h | Added page index filtering support for comparison predicates |
| be/src/olap/in_list_predicate.h | Added page index filtering support for IN list predicates |
| be/src/olap/null_predicate.h | Added page index filtering support for NULL predicates |
| be/src/olap/push_handler.{h,cpp} | Removed unused colname_to_value_range member |
| be/test/vec/exec/*.cpp | Updated test code to remove colname_to_value_range parameter, removed obsolete test methods |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
The comment "The encoded Parquet min-max value is parsed into fields" has an extra space between "into" and the backtick. It should be a single space.
| // The encoded Parquet min-max value is parsed into `fields`; | |
| // The encoded Parquet min-max value is parsed into `fields`; |
There was a problem hiding this comment.
The spelling "diable" should be "disable".
There was a problem hiding this comment.
Duplicate query identifiers detected. Lines 84, 85, and 86 all use the same identifier order_qt_q33, but they are executing different queries. Each query should have a unique identifier, such as order_qt_q34 and order_qt_q35 for the second and third queries respectively.
There was a problem hiding this comment.
The comment "check this range contain this tow group" has a spelling error. "tow" should be "row".
| // check this range contain this tow group. | |
| // check this range contain this row group. |
|
run buildall |
TPC-H: Total hot run time: 34203 ms |
TPC-DS: Total hot run time: 187676 ms |
ClickBench: Total hot run time: 27.54 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
…ition.columns prop table cause be core. (#58532) ### What problem does this PR solve? Related PR: #57771 Problem Summary: Fixed a core issue when reading Hudi Parquet format tables with the `hoodie.properties` `hoodie.datasource.write.drop.partition.columns=false`. ``` *** SIGSEGV address not mapped to object (@0x18) received by PID 12234 (TID 38368 OR 0x7f0bd279e640) from PID 24; stack trace: *** 11:01:31 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:420 11:01:31 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 3# 0x00007F18963FB520 in /lib/x86_64-linux-gnu/libc.so.6 11:01:31 4# std::_Function_handler<bool (doris::vectorized::ParquetPredicate::PageIndexStat**, int), doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*)::$_1>::_M_invoke(std::_Any_data const&, doris::vectorized::ParquetPredicate::PageIndexStat**&&, int&&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292 11:01:31 5# doris::InListPredicateBase<(doris::PrimitiveType)2, (doris::PredicateType)7, doris::HybridSet<(doris::PrimitiveType)2, doris::FixedContainer<bool, 1ul>, doris::vectorized::PredicateColumnType<(doris::PrimitiveType)2> > >::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/in_list_predicate.h:345 11:01:31 6# doris::AndBlockColumnPredicate::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/block_column_predicate.cpp:148 11:01:31 7# doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 8# doris::vectorized::ParquetReader::_process_min_max_bloom_filter(doris::vectorized::RowGroupReader::RowGroupIndex const&, tparquet::RowGroup const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:1082 11:01:31 9# doris::vectorized::ParquetReader::_next_row_group_reader() in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:598 11:01:31 11# doris::vectorized::HudiReader::get_next_block_inner(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/hudi_reader.cpp:29 11:01:31 12# doris::vectorized::TableFormatReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/table_format_reader.h:82 11:01:31 13# doris::vectorized::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/file_scanner.cpp:472 ```
…r parquet page index. (apache#57771) Problem Summary: 1. The previous page index could only handle SQL WHERE conditions that only contained AND, but this PR can handle conditions that contain OR. 2. Because the topn runtime filter is dynamically maintained, this PR delays the timing of the topn RF min-max filter until the row group reader is created.
…ition.columns prop table cause be core. (apache#58532) ### What problem does this PR solve? Related PR: apache#57771 Problem Summary: Fixed a core issue when reading Hudi Parquet format tables with the `hoodie.properties` `hoodie.datasource.write.drop.partition.columns=false`. ``` *** SIGSEGV address not mapped to object (@0x18) received by PID 12234 (TID 38368 OR 0x7f0bd279e640) from PID 24; stack trace: *** 11:01:31 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:420 11:01:31 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 3# 0x00007F18963FB520 in /lib/x86_64-linux-gnu/libc.so.6 11:01:31 4# std::_Function_handler<bool (doris::vectorized::ParquetPredicate::PageIndexStat**, int), doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*)::$_1>::_M_invoke(std::_Any_data const&, doris::vectorized::ParquetPredicate::PageIndexStat**&&, int&&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292 11:01:31 5# doris::InListPredicateBase<(doris::PrimitiveType)2, (doris::PredicateType)7, doris::HybridSet<(doris::PrimitiveType)2, doris::FixedContainer<bool, 1ul>, doris::vectorized::PredicateColumnType<(doris::PrimitiveType)2> > >::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/in_list_predicate.h:345 11:01:31 6# doris::AndBlockColumnPredicate::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/block_column_predicate.cpp:148 11:01:31 7# doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 8# doris::vectorized::ParquetReader::_process_min_max_bloom_filter(doris::vectorized::RowGroupReader::RowGroupIndex const&, tparquet::RowGroup const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:1082 11:01:31 9# doris::vectorized::ParquetReader::_next_row_group_reader() in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:598 11:01:31 11# doris::vectorized::HudiReader::get_next_block_inner(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/hudi_reader.cpp:29 11:01:31 12# doris::vectorized::TableFormatReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/table_format_reader.h:82 11:01:31 13# doris::vectorized::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/file_scanner.cpp:472 ```
…r parquet page index. (apache#57771) Problem Summary: 1. The previous page index could only handle SQL WHERE conditions that only contained AND, but this PR can handle conditions that contain OR. 2. Because the topn runtime filter is dynamically maintained, this PR delays the timing of the topn RF min-max filter until the row group reader is created.
…x filter for parquet page index. (#57771) (#59680) bp #57771 ### What problem does this PR solve? Problem Summary: 1. The previous page index could only handle SQL WHERE conditions that only contained AND, but this PR can handle conditions that contain OR. 2. Because the topn runtime filter is dynamically maintained, this PR delays the timing of the topn RF min-max filter until the row group reader is created.
…ition.columns prop table cause be core. (#58532) ### What problem does this PR solve? Related PR: #57771 Problem Summary: Fixed a core issue when reading Hudi Parquet format tables with the `hoodie.properties` `hoodie.datasource.write.drop.partition.columns=false`. ``` *** SIGSEGV address not mapped to object (@0x18) received by PID 12234 (TID 38368 OR 0x7f0bd279e640) from PID 24; stack trace: *** 11:01:31 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:420 11:01:31 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 11:01:31 3# 0x00007F18963FB520 in /lib/x86_64-linux-gnu/libc.so.6 11:01:31 4# std::_Function_handler<bool (doris::vectorized::ParquetPredicate::PageIndexStat**, int), doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*)::$_1>::_M_invoke(std::_Any_data const&, doris::vectorized::ParquetPredicate::PageIndexStat**&&, int&&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292 11:01:31 5# doris::InListPredicateBase<(doris::PrimitiveType)2, (doris::PredicateType)7, doris::HybridSet<(doris::PrimitiveType)2, doris::FixedContainer<bool, 1ul>, doris::vectorized::PredicateColumnType<(doris::PrimitiveType)2> > >::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/in_list_predicate.h:345 11:01:31 6# doris::AndBlockColumnPredicate::evaluate_and(doris::vectorized::ParquetPredicate::CachedPageIndexStat*, doris::segment_v2::RowRanges*) const at /home/zcp/repo_center/doris_master/doris/be/src/olap/block_column_predicate.cpp:148 11:01:31 7# doris::vectorized::ParquetReader::_process_page_index_filter(tparquet::RowGroup const&, doris::vectorized::RowGroupReader::RowGroupIndex const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 8# doris::vectorized::ParquetReader::_process_min_max_bloom_filter(doris::vectorized::RowGroupReader::RowGroupIndex const&, tparquet::RowGroup const&, std::vector<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> >, std::allocator<std::unique_ptr<doris::MutilColumnBlockPredicate, std::default_delete<doris::MutilColumnBlockPredicate> > > > const&, doris::segment_v2::RowRanges*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:1082 11:01:31 9# doris::vectorized::ParquetReader::_next_row_group_reader() in /mnt/hdd01/ci/doris-deploy-master-local/be/lib/doris_be 11:01:31 10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:598 11:01:31 11# doris::vectorized::HudiReader::get_next_block_inner(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/hudi_reader.cpp:29 11:01:31 12# doris::vectorized::TableFormatReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/table/table_format_reader.h:82 11:01:31 13# doris::vectorized::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/file_scanner.cpp:472 ```
What problem does this PR solve?
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)