Skip to content

[BUG] BE may probabilistic trigger segmentfault when BE exit #5213

@stdpain

Description

@stdpain

Describe the bug
BE may probabilistic trigger segmentfault when BE exit
This bug will not affect the function, but it may increase the difficulty of subsequent troubleshooting such as heap-profile

here is a coredump (master build with debug)

Core was generated by `/home/users/stdpain/opt/doris-deploy/be/lib/palo_be'.
Program terminated with signal SIGSEGV, Segmentation fault.
b#0  0x00007ff97bb7d09c in __gnu_cxx::__normal_iterator<doris::TabletManager::tablets_shard*, std::vector<doris::TabletManager::tablets_shard, std::allocator<doris::TabletManager::tablets_shard> > >::__normal_iterator (this=0x7ff904d442b8, __i=<error reading variable>)
    at /ssd1/opt/stdpain/workspace/doris/workspace/doris-toolchain/gcc730/include/c++/7.3.0/bits/stl_iterator.h:780
780           : _M_current(__i) { }
[Current thread is 1 (LWP 38702)]
warning: File "/ssd1/opt/fenghaoasuch/workspace/doris/workspace/doris-toolchain/gcc730/lib64/libstdc++.so.6.0.24-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
(gdb) bt
#0  0x00007ff97bb7d09c in __gnu_cxx::__normal_iterator<doris::TabletManager::tablets_shard*, std::vector<doris::TabletManager::tablets_shard, std::allocator<doris::TabletManager::tablets_shard> > >::__normal_iterator (this=0x7ff904d442b8, __i=<error reading variable>)
    at /ssd1/opt/stdpain/workspace/doris/workspace/doris-toolchain/gcc730/include/c++/7.3.0/bits/stl_iterator.h:780
#1  0x00007ff97bb7b3df in std::vector<doris::TabletManager::tablets_shard, std::allocator<doris::TabletManager::tablets_shard> >::begin (this=0x8)
    at /ssd1/opt/stdpain/workspace/doris/workspace/doris-toolchain/gcc730/include/c++/7.3.0/bits/stl_vector.h:564
#2  0x00007ff97bb6ed37 in doris::TabletManager::find_best_tablet_to_compaction (this=0x0,
    compaction_type=doris::CUMULATIVE_COMPACTION, data_dir=0x55fca00,
    tablet_submitted_compaction=std::vector of length 0, capacity 0)
    at /home/users/stdpain/doris/core/be/src/olap/tablet_manager.cpp:681
#3  0x00007ff97ba76a83 in doris::StorageEngine::_compaction_tasks_generator (this=0x558cc00,
    compaction_type=doris::CUMULATIVE_COMPACTION,
    data_dirs=std::vector of length 1, capacity 1 = {...})
    at /home/users/stdpain/doris/core/be/src/olap/olap_server.cpp:397
#4  0x00007ff97ba764d5 in doris::StorageEngine::_compaction_tasks_producer_callback (this=0x558cc00)
    at /home/users/stdpain/doris/core/be/src/olap/olap_server.cpp:337
#5  0x00007ff97ba73d39 in doris::StorageEngine::<lambda()>::operator()(void) const (
    __closure=0x6fb8f18) at /home/users/stdpain/doris/core/be/src/olap/olap_server.cpp:78
#6  0x00007ff97ba77ae1 in std::_Function_handler<void(), doris::StorageEngine::start_bg_threads()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /ssd1/opt/stdpain/workspace/doris/workspace/doris-toolchain/gcc730/include/c++/7.3.0/bits/std_function.h:316
#7  0x00007ff97cf14b7c in std::function<void ()>::operator()() const (this=0x6fb8f18)
    at /ssd1/opt/stdpain/workspace/doris/workspace/doris-toolchain/gcc730/include/c++/7.3.0/bits/std_function.h:706
#8  0x00007ff97a7163ce in doris::Thread::supervise_thread (arg=0x6fb8f00)
    at /home/users/stdpain/doris/core/be/src/util/thread.cpp:386
#9  0x00007ff978cd21c3 in start_thread () from /opt/compiler/gcc-4.8.2/lib64/libpthread.so.0
#10 0x00007ff9782f512d in clone () from /opt/compiler/gcc-4.8.2/lib64/libc.so.6

Here was be.out when rebuild with ASAN:

=================================================================
==54102==ERROR: AddressSanitizer: heap-use-after-free on address 0x6190000cddc8 at pc 0x000001d36929 bp 0x7fcbbb572b70 sp 0x7fcbbb572b68
READ of size 8 at 0x6190000cddc8 thread T233 (compaction_task)
    #0 0x1d36928 in std::_Rb_tree<doris::DataDir*, std::pair<doris::DataDir* const, std::vector<long, std::allocator<long> > >, std::_Select1st<std::pair<doris::DataDir* const, std::vector<long, std::allocator<long> > > >, std::less<doris::DataDir*>, std::allocator<std::pair<doris::DataDir* const, std::vector<long, std::allocator<long> > > > >::_M_begin() /ssd1/opt/stdpain/workspace/doris/workspace/doris-toolchain/gcc730/include/c++/7.3.0/bits/stl_tree.h:737
    ...

To Reproduce
It's hard to reproduce the bug... but I found a way to stabilize the recurrence problem ....

we could modify be/service/doris_main.cpp:

    heartbeat_thrift_server = nullptr;
    sleep(20); // modify here
    doris::ExecEnv::destroy(exec_env);
    return 0;
  1. exec ./bin/start_be.sh
  2. kill be

It seems that when StorageEngine is deleted , but the bachground thread is still runting, when background thread try to access StorageEngine ... BE will crash

Expected behavior
BE shouldn't exit with segmentfault,

Desktop (please complete the following information):

  • OS: CentOS 6

** Some Solution **
make StorageEngine extends shared_from_this
or
wait backgroud exit before StorageEngine destroyed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions