Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

CI CentOS-GPU tests failed #20494

@barry-jin

Description

@barry-jin

Description

CentOS gpu tests in master branch looks flaky

https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-20491/4/pipeline

Error Message

[2021-08-06T14:49:24.760Z] E           mxnet.base.MXNetError: Traceback (most recent call last):
[2021-08-06T14:49:24.760Z] E             [bt] (13) /usr/lib64/libc.so.6(clone+0x6d) [0x7fd07b25f9fd]
[2021-08-06T14:49:24.760Z] E             [bt] (12) /usr/lib64/libpthread.so.0(+0x7ea5) [0x7fd07bc3fea5]
[2021-08-06T14:49:24.760Z] E             [bt] (11) /work/mxnet/python/mxnet/../../build/libmxnet.so(+0xc64573f) [0x7fd023a2973f]
[2021-08-06T14:49:24.760Z] E             [bt] (10) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >::_M_run()+0x3a) [0x7fd018f00a4a]
[2021-08-06T14:49:24.760Z] E             [bt] (9) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x3e) [0x7fd018f04a6e]
[2021-08-06T14:49:24.760Z] E             [bt] (8) /work/mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x14d) [0x7fd018f048fd]
[2021-08-06T14:49:24.760Z] E             [bt] (7) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x10e) [0x7fd018f0371e]
[2021-08-06T14:49:24.760Z] E             [bt] (6) /work/mxnet/python/mxnet/../../build/libmxnet.so(+0x1b1341e) [0x7fd018ef741e]
[2021-08-06T14:49:24.760Z] E             [bt] (5) /work/mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x17) [0x7fd018f79427]
[2021-08-06T14:49:24.760Z] E             [bt] (4) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x8d6) [0x7fd018f78906]
[2021-08-06T14:49:24.760Z] E             [bt] (3) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::op::BinaryScalarRTCCompute::operator()(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x57e) [0x7fd01d07d63e]
[2021-08-06T14:49:24.760Z] E             [bt] (2) /work/mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::common::cuda::rtc::VectorizedKernelRTCLauncher<mxnet::op::binary_scalar_kernel_params>(std::string const&, std::string const&, std::string const&, int, int, int, mshadow::Stream<mshadow::gpu>*, mxnet::op::binary_scalar_kernel_params, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, int, int, int)+0x611) [0x7fd01d081a71]
[2021-08-06T14:49:24.760Z] E             [bt] (1) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::common::cuda::rtc::get_function(std::string const&, std::string const&, std::string const&, int)+0x21c2) [0x7fd018e91e72]
[2021-08-06T14:49:24.760Z] E             [bt] (0) /work/mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6d) [0x7fd018c0c90d]
[2021-08-06T14:49:24.760Z] E             File "/work/mxnet/src/common/cuda/rtc.cc", line 258
[2021-08-06T14:49:24.760Z] E           MXNetError: CUDA Driver: Unknown error -1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions