This repository was archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
CI CentOS-GPU tests failed #20494
Copy link
Copy link
Closed
Labels
Description
Description
CentOS gpu tests in master branch looks flaky
Error Message
[2021-08-06T14:49:24.760Z] E mxnet.base.MXNetError: Traceback (most recent call last):
[2021-08-06T14:49:24.760Z] E [bt] (13) /usr/lib64/libc.so.6(clone+0x6d) [0x7fd07b25f9fd]
[2021-08-06T14:49:24.760Z] E [bt] (12) /usr/lib64/libpthread.so.0(+0x7ea5) [0x7fd07bc3fea5]
[2021-08-06T14:49:24.760Z] E [bt] (11) /work/mxnet/python/mxnet/../../build/libmxnet.so(+0xc64573f) [0x7fd023a2973f]
[2021-08-06T14:49:24.760Z] E [bt] (10) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()+0x3a) [0x7fd018f00a4a]
[2021-08-06T14:49:24.760Z] E [bt] (9) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x3e) [0x7fd018f04a6e]
[2021-08-06T14:49:24.760Z] E [bt] (8) /work/mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x14d) [0x7fd018f048fd]
[2021-08-06T14:49:24.760Z] E [bt] (7) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x10e) [0x7fd018f0371e]
[2021-08-06T14:49:24.760Z] E [bt] (6) /work/mxnet/python/mxnet/../../build/libmxnet.so(+0x1b1341e) [0x7fd018ef741e]
[2021-08-06T14:49:24.760Z] E [bt] (5) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x17) [0x7fd018f79427]
[2021-08-06T14:49:24.760Z] E [bt] (4) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x8d6) [0x7fd018f78906]
[2021-08-06T14:49:24.760Z] E [bt] (3) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::BinaryScalarRTCCompute::operator()(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x57e) [0x7fd01d07d63e]
[2021-08-06T14:49:24.760Z] E [bt] (2) /work/mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::common::cuda::rtc::VectorizedKernelRTCLauncher<mxnet::op::binary_scalar_kernel_params>(std::string const&, std::string const&, std::string const&, int, int, int, mshadow::Stream<mshadow::gpu>*, mxnet::op::binary_scalar_kernel_params, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, int, int, int)+0x611) [0x7fd01d081a71]
[2021-08-06T14:49:24.760Z] E [bt] (1) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::common::cuda::rtc::get_function(std::string const&, std::string const&, std::string const&, int)+0x21c2) [0x7fd018e91e72]
[2021-08-06T14:49:24.760Z] E [bt] (0) /work/mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6d) [0x7fd018c0c90d]
[2021-08-06T14:49:24.760Z] E File "/work/mxnet/src/common/cuda/rtc.cc", line 258
[2021-08-06T14:49:24.760Z] E MXNetError: CUDA Driver: Unknown error -1
Reactions are currently unavailable