Skip to content

parquet: O(1) skip for bw=0 miniblocks in DeltaBitPackDecoder #9783

@sahuagin

Description

@sahuagin

The initial idea came in 2017 while evaluating the parquet specification for a project. In
particular I got to the Delta Encoding description and noted that the format was
<min delta> <list of bitwidths of miniblocks> <miniblocks>. Thinking about how a column
traversal for a scan would occur, the block would be expanded into memory, scanned, and then
tossed. However, there is already enough information in the header to determine if a value
falls between the min and max of the miniblocks in this block — rather than decompressing and
comparing, why not compress the predicate and compare against the compressed data?

I didn't have time to explore the idea at the time, but I continued to be curious whether it'd
been implemented as part of the predicate pushdown work. Recently I explored this with Claude,
giving it a research task to see if the idea had been implemented. Informed that it hadn't,
we worked together on a test project to see if the idea had any merit. Satisfied that it worked,
we then integrated it into the codebase. The API change (Decoder::scan_filtered) is isolated
to the final PR so it can be targeted independently.

This first PR addresses the simplest case: bw=0 miniblocks in the non-terminal skip path.


In DeltaBitPackDecoder::skip(), miniblocks with bit_width=0 currently call
get_batch and iterate over 32 or 64 values even though every delta in the
miniblock equals min_delta exactly (remainders are all zero). No bit reads are
needed; the only state update is advancing last_value by n * min_delta.

Proposed fix: When bit_width == 0 in the non-terminal skip path, replace
the get_batch call with a single wrapping_mul + wrapping_add. When
min_delta == 0 the loop body is a no-op and can be skipped entirely.

Measured improvement (arrow_reader bench, vs upstream HEAD):

  • bw=0 single-value skip: -21.6%
  • bw=0 increasing-value skip: -24.3%

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions