Skip to content

[server] Recover log and index file for unclean shutdown#1749

Merged
wuchong merged 6 commits into
apache:mainfrom
LiebingYu:fix-corrupt-index-new
Nov 10, 2025
Merged

[server] Recover log and index file for unclean shutdown#1749
wuchong merged 6 commits into
apache:mainfrom
LiebingYu:fix-corrupt-index-new

Conversation

@LiebingYu
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #1716

Brief change log

Tests

API and Format

Documentation

@LiebingYu LiebingYu force-pushed the fix-corrupt-index-new branch 2 times, most recently from b9b5f2d to ceaf704 Compare September 24, 2025 07:50
@LiebingYu
Copy link
Copy Markdown
Contributor Author

Ready for CR @wuchong @swuferhong

Copy link
Copy Markdown
Contributor

@swuferhong swuferhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiebingYu Thanks for your work. I left some comments.

@LiebingYu LiebingYu force-pushed the fix-corrupt-index-new branch from ceaf704 to 1e8a582 Compare September 25, 2025 14:06
// can potentially skip over more segments.
LogTablet.rebuildWriterState(
writerStateManager, logSegments, 0, segment.getBaseOffset(), false);
int bytesTruncated = segment.recover();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass writerStateManager into segment.recover() to update writer state like how kafka does? cc @swuferhong

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kafka use writerStateManager in segment.recover() mainly for txn operations. I think Fluss doen't need here. What's your opinion @swuferhong?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing offline with @LiebingYu , we confirmed that it‘s necessary to rebuild the writerState1. Although it is not used during recovery, this code path ensures that when writerStateManager.takeSnapshot()` is called, a correct snapshot up to the current segment is written to disk. This way, during recovery, each segment can be restored based on a fully accurate writerState snapshot, without requiring a full global reconstruction.

segment.getFileLogRecords().file().getAbsoluteFile());
}
recoverSegment(segment);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we catch CorruptIndexException and recover segment for this case like how kafka does? And we should rethrow other exceptions in else branch.

Copy link
Copy Markdown
Contributor Author

@LiebingYu LiebingYu Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Kafka throw CorruptIndexException because Kafka will do sanity check for transaction file which may throw CorruptIndexException. But here in fluss we don't do sanity check of index files in segment.sanityCheck. So I think we don't need to catch CorruptIndexException here.

Comment thread fluss-server/src/test/java/org/apache/fluss/server/log/LogLoaderTest.java Outdated
Comment thread fluss-server/src/test/java/org/apache/fluss/server/log/LogLoaderTest.java Outdated
Comment thread fluss-server/src/main/java/org/apache/fluss/server/log/LogSegment.java Outdated
Comment thread fluss-server/src/main/java/org/apache/fluss/server/log/LogLoader.java Outdated
@wuchong
Copy link
Copy Markdown
Member

wuchong commented Nov 4, 2025

I added a commit to fix the IOException signature problem.

@LiebingYu LiebingYu requested a review from wuchong November 5, 2025 05:38
@wuchong wuchong merged commit d5cb521 into apache:main Nov 10, 2025
5 checks passed
LiebingYu added a commit to LiebingYu/fluss that referenced this pull request Nov 21, 2025
LiebingYu added a commit to LiebingYu/fluss that referenced this pull request Nov 24, 2025
@LiebingYu LiebingYu mentioned this pull request Nov 26, 2025
2 tasks
swuferhong added a commit to swuferhong/fluss that referenced this pull request Nov 27, 2025
swuferhong added a commit that referenced this pull request Nov 27, 2025
Ugbot pushed a commit to Ugbot/fluss that referenced this pull request Apr 26, 2026
Ugbot pushed a commit to Ugbot/fluss that referenced this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ReplicaFetcherThread keeps throwing UnknownServerException because of corrupt index file

3 participants