Skip to content

GH-41609: [Java] Initialize non empty offset buffer for variable-size layout before exporting#41610

Closed
viirya wants to merge 3 commits into
apache:mainfrom
viirya:fix_uninit_offset_buf
Closed

GH-41609: [Java] Initialize non empty offset buffer for variable-size layout before exporting#41610
viirya wants to merge 3 commits into
apache:mainfrom
viirya:fix_uninit_offset_buf

Conversation

@viirya

@viirya viirya commented May 9, 2024

Copy link
Copy Markdown
Member

Rationale for this change

This is a follow up of #40038. In #40038, we fixed null offset buffer issue for BaseVariableWidthVector. For empty vector, instead of a empty offset buffer which turns to be a null buffer through C Data Interface, we export a non-empty offset which is supposed to contain zero value.

But the initialization code has a bug in the PR #40043, so the offset buffer is not initialized. Note that this is not a regression because the exported vector never works due to null offset buffer.

What changes are included in this PR?

Fixed the uninitialized offset buffer of empty BaseVariableWidthVector.

Are these changes tested?

Added test cases.

Are there any user-facing changes?

No

@viirya viirya requested a review from lidavidm as a code owner May 9, 2024 21:52

@viirya viirya left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the tests first to run them in CI to verify.

The fix is ready and will be submitted later.

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 9, 2024
@viirya viirya marked this pull request as draft May 9, 2024 21:58
@viirya

viirya commented May 9, 2024

Copy link
Copy Markdown
Member Author

Marked it as a draft and will make it ready for review once I submit the fix.

@github-actions github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 9, 2024
@vibhatha

Copy link
Copy Markdown
Contributor

@viirya just to make sure I understand the change here. Are we introducing the required change in this PR itself right?

And these test cases supposed to fail without that fix?

@viirya

viirya commented May 10, 2024

Copy link
Copy Markdown
Member Author

@vibhatha I have some fix locally but I am unable to run the C module tests locally on a Mac. I am trying to write these tests that are supposed to fail in CI to verify that.

Once the tests are verified and the fix are submitted, I will make this ready for review.

@vibhatha

Copy link
Copy Markdown
Contributor

Thanks for explaining @viirya . If you need help please let me know.

@viirya

viirya commented May 10, 2024

Copy link
Copy Markdown
Member Author

Thank you @vibhatha

@viirya

viirya commented May 10, 2024

Copy link
Copy Markdown
Member Author

Hmm, I think I know why the tests are not failed in Java Arrow. This is how Java Arrow imports an Utf8 array:

try (ArrowBuf offsets = importOffsets(type, VarCharVector.OFFSET_WIDTH)) {
      final int start = offsets.getInt(0);
      final int end = offsets.getInt(fieldNode.getLength() * (long) VarCharVector.OFFSET_WIDTH);
      final int len = end - start;
      ...
}

So even the offset buffer is not initialized, for empty array with one element offset buffer, end - start is always 0 that is the length of data buffer. That is why the added roundtrip tests are passed.

But in arrow-rs, it takes the last value of the offset buffer as the length of data buffer, i.e., end. If the value is not initialized to zero, the computed length of data buffer is incorrect.

That is what I found for the first offset value from the spec:

Generally the first slot in the offsets array is 0, and the last slot is the length of the values array.
When serializing this layout, we recommend normalizing the offsets to start at 0.

It looks like the first value doesn't have to be 0, although generally it is. So seems Java Arrow's approach is correct without issue?

@viirya

viirya commented May 10, 2024

Copy link
Copy Markdown
Member Author

Let me open an issue at arrow-rs and see if it makes more sense to fix it there.

@vibhatha

Copy link
Copy Markdown
Contributor

@viirya just for my curiosity, so the error you experience comes when Arrow Java vector is imported in Rust?

@viirya

viirya commented May 10, 2024

Copy link
Copy Markdown
Member Author

@viirya just for my curiosity, so the error you experience comes when Arrow Java vector is imported in Rust?

Yes

@viirya

viirya commented May 13, 2024

Copy link
Copy Markdown
Member Author

I fixed the issue at arrow-rs by apache/arrow-rs#5756. So I am closing this.

Thanks @vibhatha for review.

@viirya viirya closed this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants