Hello,
I am starting running TF on a Quadro P600, ubuntu 18.04.
I installed the latest driver, docker 19.03 and the package nvidia-container-toolkit.
My short term goal is to be able to run and train CNN for image detection.
I am trying to run the following tutorial : Google Colab
During the following call:
output_dict = model(input_tensor)
The python kernel is doing a seg fault.
I tried to run TF using the NVidia docker image (nvcr.io/nvidia/tensorflow:19.11-tf2-py3) and also the docker image provided by TF, I have the same issue with both. I also tested to run the code on CPU only and it works fine.
Can you help ?
Many thanks by advance.
Regards,
Gilles
I ran a python script with an equivalent code in gdb.
Here is the backtrace of the segfault :
#0 0x00007fc2808636e9 in tensorflow::NonMaxSuppressionV2GPUOp::Compute(tensorflow::OpKernelContext*) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so
#1 0x00007fc27c1a7b52 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
#2 0x00007fc27c205339 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
#3 0x00007fc27c2058ff in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> > const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
#4 0x00007fc27c2b6121 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
#5 0x00007fc27c2b3818 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2
#6 0x00007fc2761e166f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fc2e50096db in start_thread (arg=0x7fc0bffff700) at pthread_create.c:463
#8 0x00007fc2e534288f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Well, I thinks I figured it out by myself.
The bug is in TF, it is fixed in the nighly release: https://github.com/tensorflow/tensorflow/issues/32261