Detection training (resnet18) working on Tesla V100 GPU, but not on RTX 2080 Ti

Hi,

I am training object detection (resnet 18) network. On TeslaV100, all works well (both your sample notebaook and my classes). I am not using predefined model.

When I try to run the same code (in your docker) on RTX2080 Ti, it fails - see below. This error msg is when trying your notebook, step by step.

Any ideas ?

Thanks, Best
Danny.


INFO:tensorflow:Graph was finalized.
2019-08-29 07:46:06,019 [INFO] tensorflow: Graph was finalized.
2019-08-29 07:46:06.019813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2019-08-29 07:46:06.019840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-29 07:46:06.019847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2019-08-29 07:46:06.019854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2019-08-29 07:46:06.019960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5604 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
2019-08-29 07:46:08,679 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2019-08-29 07:46:09,362 [INFO] tensorflow: Done running local_init_op.
2019-08-29 07:48:01.491310: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6a64350
2019-08-29 07:48:01.686772: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at matrix_inverse_op.cc:191 : Internal: tensorflow/core/kernels/cuda_solvers.cc:803: cuBlas call failed status = 13
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 10, in
sys.exit(main())
File “./common/magnet_train.py”, line 24, in main
File “</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-106>”, line 2, in main
File “./drivenet/common/timer.py”, line 46, in wrapped_fn
File “./dashnet/scripts/train.py”, line 627, in main
File “./dashnet/scripts/train.py”, line 552, in run_experiment
File “./dashnet/scripts/train.py”, line 482, in train_dashnet
File “./dashnet/scripts/train.py”, line 138, in run_training_loop
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 567, in run
run_metadata=run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1134, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1119, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1191, in run
run_metadata=run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 971, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 900, in run
run_metadata_ptr)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1135, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1316, in _do_run
run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1335, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: tensorflow/core/kernels/cuda_solvers.cc:803: cuBlas call failed status = 13
[[Node: MatrixInverse_2 = MatrixInverseT=DT_FLOAT, adjoint=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op u’MatrixInverse_2’, defined at:
File “/usr/local/bin/tlt-train-g1”, line 10, in
sys.exit(main())
File “./common/magnet_train.py”, line 24, in main
File “</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-106>”, line 2, in main
File “./drivenet/common/timer.py”, line 46, in wrapped_fn
File “./dashnet/scripts/train.py”, line 627, in main
File “./dashnet/scripts/train.py”, line 552, in run_experiment
File “./dashnet/scripts/train.py”, line 457, in train_dashnet
File “./dashnet/scripts/train.py”, line 291, in build_training_graph
File “./drivenet/common/dataloader/default_dataloader.py”, line 199, in get_dataset_tensors
File “./drivenet/common/dataloader/default_dataloader.py”, line 254, in _generate_images_and_ground_truth_labels
File “./drivenet/common/dataloader/default_dataloader.py”, line 308, in _apply_augmentations_to_input_tensors
File “./drivenet/common/dataloader/augment.py”, line 300, in apply_all_transformations_to_image
File “./drivenet/common/dataloader/augment.py”, line 269, in apply_spatial_transformations_to_image
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_linalg_ops.py”, line 1049, in matrix_inverse
“MatrixInverse”, input=input, adjoint=adjoint, name=name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 3392, in create_op
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 1718, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): tensorflow/core/kernels/cuda_solvers.cc:803: cuBlas call failed status = 13
[[Node: MatrixInverse_2 = MatrixInverseT=DT_FLOAT, adjoint=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Hi dannykario,
Your docker installed cuda9.But cuda 9 is not officially supporting Turing GPU.
So, that’s the reason you got failed in RTX2080 Ti.

We will release a new version of TLT soon. The new TLT docker image will install CUDA10. CDUA10 can support Turing.