Random freezes and CUDA errors

du-ra92 · August 26, 2020, 10:35pm

Hi guys, I have a code that I’ve been using for a while but recently I updated my Nvidia driver (I use Linux) to 440xx and I started experiencing random freezes. My training goes well but usually somewhere in the first epoch it hangs forever, some other times it runs 8 epochs than hangs, others it simply freezes the entire computer, others it gives random device assertion error or CUDA errors (different runs different errors)…

I’ve already tried going back to the older driver, updating PyTorch to 1.6, using all different combinations of PyTorch and Cuda (Torch 1.4 + Cuda 10.2 on conda, Torch 1.6 + Cuda 10.2 on conda, Torch 1.6 + Cuda 11 on ArchLinux’s Python) but the issue persists…

My dmesg is full of:

[  959.195648] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.195768] pcieport 0000:00:03.0: AER: can't find device of ID0018
[  959.195769] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.195898] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[  959.195902] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00001040/00002000
[  959.195903] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.195908] pcieport 0000:00:03.0: AER:    [12] Timeout               
[  959.195915] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.195986] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  959.195987] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000040/00002000
[  959.195989] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.195992] pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
[  959.195996] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  959.195998] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000040/00002000
[  959.196000] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.196007] pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
[  959.196148] pcieport 0000:00:03.0: AER: can't find device of ID0018
[  959.196150] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0
[  959.196291] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[  959.196293] pcieport 0000:00:03.0: AER:   device [8086:6f08] error status/mask=00000040/00002000
[  959.196294] pcieport 0000:00:03.0: AER:    [ 6] BadTLP                
[  959.196300] pcieport 0000:00:03.0: AER: Multiple Corrected error received: 0000:00:03.0

And when it breaks I get a lot of EDAC sbridge: Seeking for: PCI ID 8086:6f6d

I also noticed that the training runs well for a while and then when the errors start to appear the time to finish the epoch suddenly increases a lot until eventually the program freezes.

This issue appeared after I’ve uploaded the NVIDIA drivers and it happens both on PyTorch and TensorFlow. To verify if that’s not hardware related I booted on Win10 and did some benchmarking with GFXBench and everything seems fine and the results are compatible with others RTX 2080ti.

Any help is much appreciated. Thanks!

AlbertZeyer · August 27, 2020, 6:50pm

Hi,

I might have a similar problem. I see similar messages in dmesg.

Can you check what the C stack trace is at the time it hangs? E.g. via gdb -p $PID -ex 'thread apply all bt' -ex="set confirm off" -ex quit. I see cuEventSynchronize in there.

See here for more details:

github.com/rwth-i6/returnn

hang in `cuEventSynchronize` in background thread

opened 08:41PM - 14 Jul 20 UTC

albertz

Recent RETURNN: > RETURNN starting up, version 20200706.123141--git-1bbf93a4,… date/time 2020-07-12-09-57-02 (UTC+0200), pid 1594, cwd /work/asr4/zeyer/setups-data/switchboard/2020-06-09--e2e-multi-gpu/data-train/base2.conv2l.specaug4a.wdrop03.adrop01.l2a_1e_4.ctc.devtrain.lrwa.lrt_0005.mgpu4.htd100, Python /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/bin/python3 Horovod via SGE `-pe mpi 4` and then `mpirun`: ``` cluster-cn-275-pid1594: use_horovod, CUDA_VISIBLE_DEVICES: 1 cluster-cn-220-pid5101: use_horovod, CUDA_VISIBLE_DEVICES: 1 cluster-cn-238-pid14531: use_horovod, CUDA_VISIBLE_DEVICES: 1 cluster-cn-224-pid12931: use_horovod, CUDA_VISIBLE_DEVICES: 3 New maximum RSS usage: 4.5 GB Horovod initialized. Hostname cluster-cn-238, pid 14531, rank 2 / size 4, local rank 0 / local size 1. Horovod initialized. Hostname cluster-cn-275, pid 1594, rank 3 / size 4, local rank 0 / local size 1. cluster-cn-238-pid14531: Local rank/size: 0 1 cluster-cn-275-pid1594: Local rank/size: 0 1 Horovod initialized. Hostname cluster-cn-224, pid 12931, rank 0 / size 4, local rank 0 / local size 1. cluster-cn-224-pid12931: Local rank/size: 0 1 Horovod initialized. Hostname cluster-cn-220, pid 5101, rank 1 / size 4, local rank 0 / local size 1. cluster-cn-220-pid5101: Local rank/size: 0 1 ``` Horovod settings: ``` horovod_dataset_distribution = "random_seed_offset" horovod_reduce_type = "param" horovod_param_sync_time_diff = 100. ``` This started fine, in epoch 1, and continued for many epochs, many hours, up until the end of epoch 101: ``` ... train epoch 101, step 2276, cost:ctc 0.38683144498930844, cost:ctc:exp 1.4723083054150774, cost:output/output_prob 0.2457886053112546, cost:output/output_prob:exp 1.2786292495379887, error:ctc 0.09923664154484868, error:output/output_prob 0.052434457466006286, loss 246.3717, max_size:bpe 61, max_size:bpe0 60, max_size:data 1372, mem_usage:GPU:0 9.8GB, num_seqs 5, 1.203 sec/step, elapsed 0:34:37, exp. remaining 0:00:00, complete 99.99% train epoch 101, step 2277, cost:ctc 0.5404130568496726, cost:ctc:exp 1.7167158169826067, cost:output/output_prob 0.3057799498138394, cost:output/output_prob:exp 1.3576835151642892, error:ctc 0.14335664175450802, error:output/output_prob 0.06872852332890034, loss 322.9364, max_size:bpe 75, max_size:bpe0 74, max_size:data 1434, mem_usage:GPU:0 9.8GB, num_seqs 5, 1.341 sec/step, elapsed 0:34:38, exp. remaining 0:00:00, complete 99.99% [2020-07-14 18:54:00.136244: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ranks: 3: [global_tensor_horovod_sum_have_data/HorovodAllreduce_globals_horovod_have_more_data_placeholder_0, global_tensor_horovod_sum_have_error/HorovodAllreduce_globals_horovod_have_error_placeholder_0] [2020-07-14 18:55:00.137283: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. ... ``` The last message repeats then every 60 seconds, and nothing else happens anymore: ``` [2020-07-14 22:32:00.724578: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ranks: 3: [global_tensor_horovod_sum_have_data/HorovodAllreduce_globals_horovod_have_more_data_placeholder_0, global_tensor_horovod_sum_have_error/HorovodAllreduce_globals_horovod_have_error_placeholder_0] ``` When I login to the node of rank 3, and send a SIGUSR1, I get this traceback (via `faulthandler`; other threads omitted because irrelevant), i.e. we can see that it hangs in `sess.run`: ``` Current thread 0x00007f4cf0377700 (most recent call first): File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429 in _call_tf_sessionrun File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341 in _run_fn File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356 in _do_call File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350 in _do_run File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173 in _run File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950 in run File "/u/zeyer/setups/switchboard/2020-06-09--e2e-multi-gpu/crnn/TFEngine.py", line 680 in run File "/u/zeyer/setups/switchboard/2020-06-09--e2e-multi-gpu/crnn/TFEngine.py", line 1535 in train_epoch File "/u/zeyer/setups/switchboard/2020-06-09--e2e-multi-gpu/crnn/TFEngine.py", line 1427 in train File "crnn/rnn.py", line 449 in execute_main_task File "crnn/rnn.py", line 639 in main File "crnn/rnn.py", line 651 in <module> ``` Note that this is the standard step `sess.run`, i.e. nothing specific about Horovod. Actually, with these settings, there should be no Horovod op involved at all in this call. Then, the C traceback (via `gdb -p 1594 -ex 'thread apply all bt' -ex="set confirm off" -ex quit > gdblog.p1494.txt`) is [here](https://gist.github.com/albertz/8d8ba27d27513b73cfad4865ebc8b13b). Some of the (maybe interesting) threads (excluded are `WaitForWork` or Python threads): ``` Thread 52 (Thread 0x7f4b857fe700 (LWP 1758)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f4cbc79ed73 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #2 0x00007f4cbc79e491 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #3 0x00007f4cbc79b752 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #4 0x00007f4cbc79bc63 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #5 0x00007f4cbb2359d2 in tensorflow::EventMgr::PollLoop() () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #6 0x00007f4cb22088ca in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () Thread 48 (Thread 0x7f4bb37fe700 (LWP 1752)): #0 0x00007ffe58bddb6d in clock_gettime () #1 0x00007f4cef511936 in __GI___clock_gettime (clock_id=4, tp=0x7f4bb37fb070) at ../sysdeps/unix/clock_gettime.c:115 #2 0x00007f4bfa8a1d0e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #3 0x00007f4bfa95cc77 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #4 0x00007f4bfa85e437 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #5 0x00007f4bfa7800c6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #6 0x00007f4bfa8f0e60 in cuEventSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #7 0x00007f4cbcc8e144 in stream_executor::gpu::GpuDriver::GetEventElapsedTime(stream_executor::gpu::GpuContext*, float*, CUevent_st*, CUevent_st*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #8 0x00007f4cbb395b77 in stream_executor::gpu::GpuTimer::GetElapsedMilliseconds() const () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #9 0x00007f4cb26d62c8 in stream_executor::gpu::CudnnSupport::DoConvolve(stream_executor::dnn::ConvolutionKind, stream_executor::dnn::DataType, stream_executor::Stream*, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::AlgorithmDesc, stream_executor::DeviceMemory<unsigned char>, stream_executor::dnn::ProfileResult*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1 #10 0x00007f4cbcf46506 in stream_executor::Stream::ThenConvolveWithAlgorithm(stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>*, stream_executor::ScratchAllocator*, stream_executor::dnn::AlgorithmConfig const&, stream_executor::dnn::ProfileResult*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #11 0x00007f4cba29d180 in tensorflow::LaunchConv2DOp<Eigen::GpuDevice, float>::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, std::vector<long long, std::allocator<long long> > const&, tensorflow::Tensor*, tensorflow::TensorFormat) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #12 0x00007f4cba29ddd4 in tensorflow::Conv2DOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #13 0x00007f4cb2114a0a in tensorflow::BaseGPUDevice::ComputeHelper(tensorflow::OpKernel*, tensorflow::OpKernelContext*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1 #14 0x00007f4cb2115605 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1 #15 0x00007f4cb216f2c1 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1 #16 0x00007f4cb216f37f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1 #17 0x00007f4cb22088ca in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () Thread 46 (Thread 0x7f4bb3fff700 (LWP 1750)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 #1 0x00007f4bfa8a57d7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #2 0x00007f4bfa84af27 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #3 0x00007f4bfa8a4a58 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1 #4 0x00007f4cef7cd6ba in start_thread (arg=0x7f4bb3fff700) at pthread_create.c:333 #5 0x00007f4cef5034dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Thread 42 (Thread 0x7f4c00ff9700 (LWP 1744)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f4cbcf76361 in absl::synchronization_internal::Waiter::Wait(absl::synchronization_internal::KernelTimeout) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #2 0x00007f4cbcf761c1 in AbslInternalPerThreadSemWait () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #3 0x00007f4cbcf778e5 in absl::Mutex::Block(absl::base_internal::PerThreadSynch*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #4 0x00007f4cbcf7887c in absl::Mutex::AwaitCommon(absl::Condition const&, absl::synchronization_internal::KernelTimeout) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #5 0x00007f4cbcf7890d in absl::Mutex::Await(absl::Condition const&) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #6 0x00007f4cbce5d75c in stream_executor::host::HostStream::WorkLoop() () Thread 3 (Thread 0x7f4c8e1da700 (LWP 1626)): #0 0x00007f4cef4f780d in poll () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007f4c902e9e58 in ?? () from /usr/lib/libopen-pal.so.13 #2 0x00007f4c902e06fb in opal_libevent2021_event_base_loop () from /usr/lib/libopen-pal.so.13 #3 0x00007f4c9055bb8e in ?? () from /usr/lib/libopen-rte.so.12 #4 0x00007f4cef7cd6ba in start_thread (arg=0x7f4c8e1da700) at pthread_create.c:333 Thread 2 (Thread 0x7f4c90208700 (LWP 1625)): #0 0x00007f4cef4f780d in poll () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007f4c902e9e58 in ?? () from /usr/lib/libopen-pal.so.13 #2 0x00007f4c902e06fb in opal_libevent2021_event_base_loop () from /usr/lib/libopen-pal.so.13 #3 0x00007f4c902aa238 in opal_progress () from /usr/lib/libopen-pal.so.13 #4 0x00007f4c909eef65 in ompi_request_default_wait_all () from /usr/lib/libmpi.so.12 #5 0x00007f4c843f9426 in ompi_coll_tuned_allreduce_intra_recursivedoubling () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so #6 0x00007f4c909fef23 in PMPI_Allreduce () from /usr/lib/libmpi.so.12 #7 0x00007f4c90f3eadc in horovod::common::MPIController::CrossRankBitwiseAnd (this=<optimized out>, bitvector=..., count=<optimized out>) at horovod/common/mpi/mpi_controller.cc:90 #8 0x00007f4c90f08cd2 in horovod::common::CacheCoordinator::sync (this=this@entry=0x7f4c90206fe0, controller= std::shared_ptr (count 2, weak 1) 0x26158e0, timeline_enabled=<optimized out>) at horovod/common/response_cache.cc:390 #9 0x00007f4c90ed8c6b in horovod::common::Controller::CoordinateCacheAndState (this=this@entry=0x26158e0, cache_coordinator=...) at horovod/common/controller.cc:615 #10 0x00007f4c90ee002a in horovod::common::Controller::ComputeResponseList (this=0x26158e0, shut_down=..., state=...) at horovod/common/controller.cc:137 #11 0x00007f4c90ef949b in horovod::common::(anonymous namespace)::RunLoopOnce (state=...) at horovod/common/operations.cc:568 #12 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:509 Thread 1 (Thread 0x7f4cf0377700 (LWP 1594)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f4cbc79ed73 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #2 0x00007f4cbc79e491 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #3 0x00007f4cbc79b752 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #4 0x00007f4cbc79bc63 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #5 0x00007f4cb7c1026b in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #6 0x00007f4cb7c102cb in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #7 0x00007f4cb7c18351 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #8 0x00007f4cb7c2395b in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #9 0x00007f4cb54383aa in tensorflow::SessionRef::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #10 0x00007f4cb7c55a69 in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Tensor**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Buffer*, TF_Status*) [clone .constprop.672] () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #11 0x00007f4cb7c562ed in TF_SessionRun () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #12 0x00007f4cb5434301 in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #13 0x00007f4cb54343a2 in tensorflow::TF_SessionRun_wrapper(TF_Session*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #14 0x00007f4cb53f2923 in _wrap_TF_SessionRun_wrapper () from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so #15 0x00007f4cefafa1b5 in _PyMethodDef_RawFastCallKeywords () at ../Objects/call.c:694 ``` * I wonder why it hangs in `CudnnSupport::DoConvolve`, or more specifically in `GpuDriver::GetEventElapsedTime`, or even more specifically in `cuEventSynchronize`, which looks like a CUDA issue? Or maybe I misunderstand how this [`GpuTimer`](https://github.com/tensorflow/tensorflow/blob/fa2324cc4860028ceb2bf8b2ed8f99af635b3e31/tensorflow/stream_executor/gpu/gpu_timer.cc) works. Or how is it used in [`DoConvolve`](https://github.com/tensorflow/tensorflow/blob/fe9b27b306652108c6bbc505b3d98c89315cb93b/tensorflow/stream_executor/cuda/cuda_dnn.cc#L3087). Maybe [this issue](https://stackoverflow.com/questions/25979764/cuda-hangs-on-cudadevicesynchronize-randomly), or [this](https://forums.developer.nvidia.com/t/synchronization-hangs-sporadically-after-kernel-launch/38993)? Also, if this is really a CUDA-related issue, why has it never occurred so far without Horovod? * I also wonder, why do I see `opal_libevent2021_event_base_loop` in two different threads? * As mentioned above, the `sess.run` of the main thread actually should not involve any Horovod op (with the given settings). So no MPI communication should run currently. However, there is `PMPI_Allreduce` in thread 2. Why? Note that this is not the first time I'm seeing this. I already saw it a couple of times. Originally I hoped that this is some temporary issue in our cluster but there seems to be a real problem or bug. I don't really know whether it is on our side, or on OpenMPI (we have a quite old version), or on Horovod.

Maybe this is also related:

Topic		Replies	Views
Random execution times and freezes with concurent kernels - 2 CUDA Programming and Performance	5	2745	November 10, 2015
Training multiple models on multiple GPUs hangs Frameworks (archived) pytorch	0	887	February 19, 2021
cudaMemcpy Hung CUDA Programming and Performance	21	4491	May 30, 2019
Code hangs... CUDA Programming and Performance	24	20155	August 18, 2010
410.78 driver, GPUs will lock up Linux	7	2893	March 29, 2019
My desktop freezes at random times while training with pytorch Frameworks (archived) cuda , ubuntu , pytorch	3	2079	March 11, 2024
K80 GPU disappears when tries to run 2 TensorFlow applications (one in each GPU) simultaneously. CUDA Programming and Performance	8	1990	August 1, 2017
Synchronization hangs sporadically after kernel launch CUDA Programming and Performance	23	8188	August 20, 2015
Latest Driver for GTX 1080Ti blocks Tensoflow processes? CUDA Programming and Performance	3	1307	August 10, 2017
CUDA hangups Jetson TK1	26	4004	October 18, 2021

Random freezes and CUDA errors

Related topics