Skip to content

[Bug] in massive scantasks case, be will crash in jni_connector.close()if scanner failed #56328

@wenzhenghu

Description

@wenzhenghu

Search before asking

  • I had searched in the issues and found no similar issues.

Version

branch 3, branch 3.1 and master

What's Wrong?

in paimon table scan case, if be jvm has have not enough memory:

  • oom
  • [warning][gc,alloc] Thread-272340: Retried waiting for GCLocker too often allocating 172082 words)

jni_connector will fail in some logic, then be may be crash, the stack shows below:

*** Query id: 7ac049e2ed6644aa-8ae4d8cbab5119e9 *** 
 *** is nereids: 1 *** 
 *** tablet id: 0 *** 
 *** Aborted at 1755692288 (unix time) try "date -d @1755692288" if you are using GNU date *** 
 *** Current BE git commitID: 39f9074cec *** 
 *** SIGSEGV address not mapped to object (@0x238) received by PID 267221 (TID 890774 OR 0x7a081328b700) from PID 568; stack trace: *** 
  0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421 
  1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /home/doris/dorisdb_pub/java17/lib/server/libjvm.so 
  2# JVM_handle_linux_signal in /home/doris/dorisdb_pub/java17/lib/server/libjvm.so 
  3# 0x00007F0AD94D6770 in /lib64/libc.so.6 
  4# OopStorage::Block::release_entries(unsigned long, OopStorage*) in /home/doris/dorisdb_pub/java17/lib/server/libjvm.so 
  5# OopStorage::release(oopDesc* const*) in /home/doris/dorisdb_pub/java17/lib/server/libjvm.so 
  6# jni_DeleteGlobalRef in /home/doris/dorisdb_pub/java17/lib/server/libjvm.so 
  7# doris::vectorized::JniConnector::close() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/jni_connector.cpp:182 
  8# doris::vectorized::JniReader::close() in /home/doris/dorisdb_pub/be/lib/doris_be 
  9# doris::vectorized::VFileScanner::close(doris::RuntimeState*) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vfile_scanner.cpp:1175 
 10# doris::vectorized::ScannerDelegate::~ScannerDelegate() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/vscan_node.h:36 
 11# doris::vectorized::ScannerScheduler::_scanner_scan(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:323 
 12# std::_Function_handler<void (), doris::vectorized::ScannerScheduler::submit(std::shared_ptr<doris::vectorized::ScannerContext>, std::shared_ptr<doris::vectorized::ScanTask>)::$_1::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291 
 13# doris::ThreadPool::dispatch_thread() in /home/doris/dorisdb_pub/be/lib/doris_be

Through code analysis, we can determine that the cause of the BE crash was ​repeated calls to the jni_connector close method, leading to a ​double free of the related JNI memory.

This is an occasional issue, Failures in massive parallel scanning tasks significantly raise the probability。

What You Expected?

be will not crash, just sql return error

How to Reproduce?

  1. set very small be jvm size
  2. create paimon table with massive small files
  3. select this table
    be may be crash

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions