Recent RETURNN:
> RETURNN starting up, version 20200706.123141--git-1bbf93a4,… date/time 2020-07-12-09-57-02 (UTC+0200), pid 1594, cwd /work/asr4/zeyer/setups-data/switchboard/2020-06-09--e2e-multi-gpu/data-train/base2.conv2l.specaug4a.wdrop03.adrop01.l2a_1e_4.ctc.devtrain.lrwa.lrt_0005.mgpu4.htd100, Python /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/bin/python3
Horovod via SGE `-pe mpi 4` and then `mpirun`:
```
cluster-cn-275-pid1594: use_horovod, CUDA_VISIBLE_DEVICES: 1
cluster-cn-220-pid5101: use_horovod, CUDA_VISIBLE_DEVICES: 1
cluster-cn-238-pid14531: use_horovod, CUDA_VISIBLE_DEVICES: 1
cluster-cn-224-pid12931: use_horovod, CUDA_VISIBLE_DEVICES: 3
New maximum RSS usage: 4.5 GB
Horovod initialized. Hostname cluster-cn-238, pid 14531, rank 2 / size 4, local rank 0 / local size 1.
Horovod initialized. Hostname cluster-cn-275, pid 1594, rank 3 / size 4, local rank 0 / local size 1.
cluster-cn-238-pid14531: Local rank/size: 0 1
cluster-cn-275-pid1594: Local rank/size: 0 1
Horovod initialized. Hostname cluster-cn-224, pid 12931, rank 0 / size 4, local rank 0 / local size 1.
cluster-cn-224-pid12931: Local rank/size: 0 1
Horovod initialized. Hostname cluster-cn-220, pid 5101, rank 1 / size 4, local rank 0 / local size 1.
cluster-cn-220-pid5101: Local rank/size: 0 1
```
Horovod settings:
```
horovod_dataset_distribution = "random_seed_offset"
horovod_reduce_type = "param"
horovod_param_sync_time_diff = 100.
```
This started fine, in epoch 1, and continued for many epochs, many hours, up until the end of epoch 101:
```
...
train epoch 101, step 2276, cost:ctc 0.38683144498930844, cost:ctc:exp 1.4723083054150774, cost:output/output_prob 0.2457886053112546, cost:output/output_prob:exp 1.2786292495379887, error:ctc 0.09923664154484868, error:output/output_prob 0.052434457466006286, loss 246.3717, max_size:bpe 61, max_size:bpe0 60, max_size:data 1372, mem_usage:GPU:0 9.8GB, num_seqs 5, 1.203 sec/step, elapsed 0:34:37, exp. remaining 0:00:00, complete 99.99%
train epoch 101, step 2277, cost:ctc 0.5404130568496726, cost:ctc:exp 1.7167158169826067, cost:output/output_prob 0.3057799498138394, cost:output/output_prob:exp 1.3576835151642892, error:ctc 0.14335664175450802, error:output/output_prob 0.06872852332890034, loss 322.9364, max_size:bpe 75, max_size:bpe0 74, max_size:data 1434, mem_usage:GPU:0 9.8GB, num_seqs 5, 1.341 sec/step, elapsed 0:34:38, exp. remaining 0:00:00, complete 99.99%
[2020-07-14 18:54:00.136244: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ranks:
3: [global_tensor_horovod_sum_have_data/HorovodAllreduce_globals_horovod_have_more_data_placeholder_0, global_tensor_horovod_sum_have_error/HorovodAllreduce_globals_horovod_have_error_placeholder_0]
[2020-07-14 18:55:00.137283: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
...
```
The last message repeats then every 60 seconds, and nothing else happens anymore:
```
[2020-07-14 22:32:00.724578: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ranks:
3: [global_tensor_horovod_sum_have_data/HorovodAllreduce_globals_horovod_have_more_data_placeholder_0, global_tensor_horovod_sum_have_error/HorovodAllreduce_globals_horovod_have_error_placeholder_0]
```
When I login to the node of rank 3, and send a SIGUSR1, I get this traceback (via `faulthandler`; other threads omitted because irrelevant), i.e. we can see that it hangs in `sess.run`:
```
Current thread 0x00007f4cf0377700 (most recent call first):
File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429 in _call_tf_sessionrun
File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341 in _run_fn
File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356 in _do_call
File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350 in _do_run
File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173 in _run
File "/work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950 in run
File "/u/zeyer/setups/switchboard/2020-06-09--e2e-multi-gpu/crnn/TFEngine.py", line 680 in run
File "/u/zeyer/setups/switchboard/2020-06-09--e2e-multi-gpu/crnn/TFEngine.py", line 1535 in train_epoch
File "/u/zeyer/setups/switchboard/2020-06-09--e2e-multi-gpu/crnn/TFEngine.py", line 1427 in train
File "crnn/rnn.py", line 449 in execute_main_task
File "crnn/rnn.py", line 639 in main
File "crnn/rnn.py", line 651 in <module>
```
Note that this is the standard step `sess.run`, i.e. nothing specific about Horovod. Actually, with these settings, there should be no Horovod op involved at all in this call.
Then, the C traceback (via `gdb -p 1594 -ex 'thread apply all bt' -ex="set confirm off" -ex quit > gdblog.p1494.txt`) is
[here](https://gist.github.com/albertz/8d8ba27d27513b73cfad4865ebc8b13b). Some of the (maybe interesting) threads (excluded are `WaitForWork` or Python threads):
```
Thread 52 (Thread 0x7f4b857fe700 (LWP 1758)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007f4cbc79ed73 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007f4cbc79e491 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x00007f4cbc79b752 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007f4cbc79bc63 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007f4cbb2359d2 in tensorflow::EventMgr::PollLoop() ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007f4cb22088ca in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
Thread 48 (Thread 0x7f4bb37fe700 (LWP 1752)):
#0 0x00007ffe58bddb6d in clock_gettime ()
#1 0x00007f4cef511936 in __GI___clock_gettime (clock_id=4, tp=0x7f4bb37fb070) at ../sysdeps/unix/clock_gettime.c:115
#2 0x00007f4bfa8a1d0e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007f4bfa95cc77 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007f4bfa85e437 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5 0x00007f4bfa7800c6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6 0x00007f4bfa8f0e60 in cuEventSynchronize () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7 0x00007f4cbcc8e144 in stream_executor::gpu::GpuDriver::GetEventElapsedTime(stream_executor::gpu::GpuContext*, float*, CUevent_st*, CUevent_st*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x00007f4cbb395b77 in stream_executor::gpu::GpuTimer::GetElapsedMilliseconds() const ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x00007f4cb26d62c8 in stream_executor::gpu::CudnnSupport::DoConvolve(stream_executor::dnn::ConvolutionKind, stream_executor::dnn::DataType, stream_executor::Stream*, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemoryBase, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::AlgorithmDesc, stream_executor::DeviceMemory<unsigned char>, stream_executor::dnn::ProfileResult*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#10 0x00007f4cbcf46506 in stream_executor::Stream::ThenConvolveWithAlgorithm(stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::FilterDescriptor const&, stream_executor::DeviceMemory<float> const&, stream_executor::dnn::ConvolutionDescriptor const&, stream_executor::dnn::BatchDescriptor const&, stream_executor::DeviceMemory<float>*, stream_executor::ScratchAllocator*, stream_executor::dnn::AlgorithmConfig const&, stream_executor::dnn::ProfileResult*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f4cba29d180 in tensorflow::LaunchConv2DOp<Eigen::GpuDevice, float>::operator()(tensorflow::OpKernelContext*, bool, bool, tensorflow::Tensor const&, tensorflow::Tensor const&, int, int, int, int, tensorflow::Padding const&, std::vector<long long, std::allocator<long long> > const&, tensorflow::Tensor*, tensorflow::TensorFormat) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f4cba29ddd4 in tensorflow::Conv2DOp<Eigen::GpuDevice, float>::Compute(tensorflow::OpKernelContext*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f4cb2114a0a in tensorflow::BaseGPUDevice::ComputeHelper(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#14 0x00007f4cb2115605 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#15 0x00007f4cb216f2c1 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#16 0x00007f4cb216f37f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/../libtensorflow_framework.so.1
#17 0x00007f4cb22088ca in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
Thread 46 (Thread 0x7f4bb3fff700 (LWP 1750)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1 0x00007f4bfa8a57d7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007f4bfa84af27 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007f4bfa8a4a58 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007f4cef7cd6ba in start_thread (arg=0x7f4bb3fff700) at pthread_create.c:333
#5 0x00007f4cef5034dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 42 (Thread 0x7f4c00ff9700 (LWP 1744)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007f4cbcf76361 in absl::synchronization_internal::Waiter::Wait(absl::synchronization_internal::KernelTimeout) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007f4cbcf761c1 in AbslInternalPerThreadSemWait ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x00007f4cbcf778e5 in absl::Mutex::Block(absl::base_internal::PerThreadSynch*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007f4cbcf7887c in absl::Mutex::AwaitCommon(absl::Condition const&, absl::synchronization_internal::KernelTimeout)
()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007f4cbcf7890d in absl::Mutex::Await(absl::Condition const&) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007f4cbce5d75c in stream_executor::host::HostStream::WorkLoop() ()
Thread 3 (Thread 0x7f4c8e1da700 (LWP 1626)):
#0 0x00007f4cef4f780d in poll () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007f4c902e9e58 in ?? () from /usr/lib/libopen-pal.so.13
#2 0x00007f4c902e06fb in opal_libevent2021_event_base_loop () from /usr/lib/libopen-pal.so.13
#3 0x00007f4c9055bb8e in ?? () from /usr/lib/libopen-rte.so.12
#4 0x00007f4cef7cd6ba in start_thread (arg=0x7f4c8e1da700) at pthread_create.c:333
Thread 2 (Thread 0x7f4c90208700 (LWP 1625)):
#0 0x00007f4cef4f780d in poll () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007f4c902e9e58 in ?? () from /usr/lib/libopen-pal.so.13
#2 0x00007f4c902e06fb in opal_libevent2021_event_base_loop () from /usr/lib/libopen-pal.so.13
#3 0x00007f4c902aa238 in opal_progress () from /usr/lib/libopen-pal.so.13
#4 0x00007f4c909eef65 in ompi_request_default_wait_all () from /usr/lib/libmpi.so.12
#5 0x00007f4c843f9426 in ompi_coll_tuned_allreduce_intra_recursivedoubling ()
from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#6 0x00007f4c909fef23 in PMPI_Allreduce () from /usr/lib/libmpi.so.12
#7 0x00007f4c90f3eadc in horovod::common::MPIController::CrossRankBitwiseAnd (this=<optimized out>, bitvector=...,
count=<optimized out>) at horovod/common/mpi/mpi_controller.cc:90
#8 0x00007f4c90f08cd2 in horovod::common::CacheCoordinator::sync (this=this@entry=0x7f4c90206fe0, controller=
std::shared_ptr (count 2, weak 1) 0x26158e0, timeline_enabled=<optimized out>) at horovod/common/response_cache.cc:390
#9 0x00007f4c90ed8c6b in horovod::common::Controller::CoordinateCacheAndState (this=this@entry=0x26158e0,
cache_coordinator=...) at horovod/common/controller.cc:615
#10 0x00007f4c90ee002a in horovod::common::Controller::ComputeResponseList (this=0x26158e0, shut_down=..., state=...)
at horovod/common/controller.cc:137
#11 0x00007f4c90ef949b in horovod::common::(anonymous namespace)::RunLoopOnce (state=...)
at horovod/common/operations.cc:568
#12 horovod::common::(anonymous namespace)::BackgroundThreadLoop (state=...) at horovod/common/operations.cc:509
Thread 1 (Thread 0x7f4cf0377700 (LWP 1594)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007f4cbc79ed73 in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007f4cbc79e491 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x00007f4cbc79b752 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007f4cbc79bc63 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007f4cb7c1026b in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007f4cb7c102cb in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00007f4cb7c18351 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x00007f4cb7c2395b in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x00007f4cb54383aa in tensorflow::SessionRef::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f4cb7c55a69 in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorflow::Tensor> > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Tensor**, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, TF_Buffer*, TF_Status*) [clone .constprop.672] ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f4cb7c562ed in TF_SessionRun ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f4cb5434301 in tensorflow::TF_SessionRun_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f4cb54343a2 in tensorflow::TF_SessionRun_wrapper(TF_Session*, TF_Buffer const*, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<_object*, std::allocator<_object*> > const&, std::vector<TF_Output, std::allocator<TF_Output> > const&, std::vector<TF_Operation*, std::allocator<TF_Operation*> > const&, TF_Buffer*, TF_Status*, std::vector<_object*, std::allocator<_object*> >*) ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f4cb53f2923 in _wrap_TF_SessionRun_wrapper ()
from /work/tools/asr/python/3.7.1_tf_1.14-generic+cuda10.1/lib/python3.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x00007f4cefafa1b5 in _PyMethodDef_RawFastCallKeywords () at ../Objects/call.c:694
```
* I wonder why it hangs in `CudnnSupport::DoConvolve`, or more specifically in `GpuDriver::GetEventElapsedTime`, or even more specifically in `cuEventSynchronize`, which looks like a CUDA issue? Or maybe I misunderstand how this [`GpuTimer`](https://github.com/tensorflow/tensorflow/blob/fa2324cc4860028ceb2bf8b2ed8f99af635b3e31/tensorflow/stream_executor/gpu/gpu_timer.cc) works. Or how is it used in [`DoConvolve`](https://github.com/tensorflow/tensorflow/blob/fe9b27b306652108c6bbc505b3d98c89315cb93b/tensorflow/stream_executor/cuda/cuda_dnn.cc#L3087). Maybe [this issue](https://stackoverflow.com/questions/25979764/cuda-hangs-on-cudadevicesynchronize-randomly), or [this](https://forums.developer.nvidia.com/t/synchronization-hangs-sporadically-after-kernel-launch/38993)? Also, if this is really a CUDA-related issue, why has it never occurred so far without Horovod?
* I also wonder, why do I see `opal_libevent2021_event_base_loop` in two different threads?
* As mentioned above, the `sess.run` of the main thread actually should not involve any Horovod op (with the given settings). So no MPI communication should run currently. However, there is `PMPI_Allreduce` in thread 2. Why?
Note that this is not the first time I'm seeing this. I already saw it a couple of times. Originally I hoped that this is some temporary issue in our cluster but there seems to be a real problem or bug. I don't really know whether it is on our side, or on OpenMPI (we have a quite old version), or on Horovod.