Profiling hangs in cuda/cupti .so

Hi, i am using tensorflow profiler to profile train of my model. Before the train starts I get the following lines

2023-12-10 16:31:15.256811: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2023-12-10 16:31:15.257177: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 8 GPUs
2023-12-10 16:31:15.669753: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2023-12-10 16:31:15.670454: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
[2023-12-10 16:31:15,691 | trajectory_predictor.neural.models | 33727 | INFO] Tensorboard logs will be available in /tmp/tmp9vfwctuu_tb_logs

However when the real training starts it hangs.

2023-12-10 16:31:18.057702: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2023-12-10 16:31:18.058010: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.

I’ve got such backtrace

#0  futex_abstimed_wait (private=0, abstime=0x0, clockid=0, expected=2, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:284
#1  __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0x895a2a0) at pthread_rwlock_common.c:830
#2  __GI___pthread_rwlock_wrlock (rwlock=0x895a2a0) at pthread_rwlock_wrlock.c:27
#3  0x00007fdbd52fa258 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fdbd523fcc1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fd95a8bc01a in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#6  0x00007fd95a8ba35c in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#7  0x00007fd95a89ae62 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#8  0x00007fd95a8979b2 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#9  0x00007fd95a89891b in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#10 0x00007fd95a86aa86 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#11 0x00007fd95a86acf8 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#12 0x00007fd95a86be6c in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#13 0x00007fdbd5058b5b in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#14 0x00007fdbd52ff6a0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#15 0x00007fdbd502c7a6 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#16 0x00007fdbd502e792 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#17 0x00007fdbd512f2ca in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#18 0x00007fdc6a1281cb in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudart.so.11.0
#19 0x00007fdc6a16b7e6 in cudaLaunchKernel () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudart.so.11.0
#20 0x00007fdc7f7b6987 in ?? () from /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#21 0x00007fdc7f7b8715 in tensorflow::functor::FillFunctor<Eigen::GpuDevice, float>::operator()(Eigen::GpuDevice const&, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<float const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>) () from /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

It seems that libcuda.so.1 and libcupti.so.11.1 were created without debug symbols and are NVIDIA property, so are there any ways to find out what happens?

UPD. i tried using strace -f -p my_pid
I found a lot of lines like

 poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}, {fd=26, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLI

lsof -p showed that this fd refer to /dev/nvidia0

Hi, @annie09032002ka

Thanks for contacting us !
Which driver are you using ? We have a similar issue before and upgrade driver can fix. Please have a try with latest driver.

Also you can try to upgrade the CUDA version, as 11.1 is pretty old.

On one host I have Driver Version: 470.182.03, on another one Driver Version: 525.60.13. Actually, if I try to profile a very simple model from the internet everything is ok, but profiling our model hangs with these logs. So, maybe you can give us hints on how to find out why we are polling descriptors?

Solutions like: try simplify your model are rather hard for us, but through gdb or such tools will be great?

You mentioned 2 hosts. Do you mean your issue can be reproduced with both driver 470 and 525 ? In which machine, you are running tensorflow profile ?

Yes, it can be reproduced with both drivers. I’ve tried to run my train on both of them

Hi, @annie09032002ka

Thanks for trying both. But it is hard to tell the reason if we don’t have a repro internally.
Is it possible to provide us the instructions to repro ?

This topic was automatically closed after 6 days. New replies are no longer allowed.