Tensorflow1.14 is not working on RTX3090 inside the Docker container of Ubuntu18.04 and CUDA10.0 with Python2

Description

Tensorflow1.14 is not working with RTX3090 inside the Docker container of CUDA10.0.
I created a program for reinforcement learning, and it works in the Docker container I created on GTX1080Ti.
I move on to a new PC with RTX3090, but it doesn’t work.
The program takes 20 minutes to start itself then error codes are shown.
On GTX1080Ti PC, it works without waiting a long time.

Note that some CUDA10.0 samples work on RTX3090 PC inside the docker container.

The error codes as follows
2020-12-01 08:12:06.722138: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2020-12-01 08:12:06.722611: I tensorflow/stream_executor/stream.cc:4838] [stream=0x555d1dba6960,impl=0x555d1dd5fe80] did not memzero GPU location; source: 0x7f1872ffbd20
2020-12-01 08:12:06.722627: I tensorflow/stream_executor/stream.cc:315] did not allocate timer: 0x7f1872ffbd30
2020-12-01 08:12:06.722633: I tensorflow/stream_executor/stream.cc:1839] [stream=0x555d1dba6960,impl=0x555d1dd5fe80] did not enqueue ‘start timer’: 0x7f1872ffbd30
2020-12-01 08:12:06.722642: I tensorflow/stream_executor/stream.cc:1851] [stream=0x555d1dba6960,impl=0x555d1dd5fe80] did not enqueue ‘stop timer’: 0x7f1872ffbd30
2020-12-01 08:12:06.722651: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr
Aborted (core dumped)

Environment

GPU Type: GTX1080TI(works), RTX3090(doesn’t work)
Nvidia Driver Version: 418.40.04(1080Ti PC), 455.45.01(3090 PC)
CUDA Version:10.1(1080Ti PC host), 11.1(3090 PC host), 10.0(docker)
CUDNN Version: 7
Operating System + Version: Ubuntu18.04(1080Ti PC), Ubuntu 20.04(3090 PC), Ubuntu 18.04(docker)
Python Version (if applicable): 2.7
TensorFlow Version (if applicable): 1.14
Baremetal or Container (if container which image + tag): I used it as a base image; nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04

Hi @asobod11138,
I suggest you to raise this concern to CUDA forum, they should be able to help you better here.

Thanks!

1 Like

Thanks! I changed this question to CUDA forum.

I am wondering if maybe TensorFlow 1.14 might not contain native code for the RTX 3000 series of cards (Ampere generation) and hence it hits the JIT (Just-in-time) compilation in the CUDA driver on the host machine.

This would explain the long startup times.

I’ve previously seen crashes in JITted code due to driver bugs, and in one case due to an overly complex kernel getting JITTed and running out of (stack?) memory in the process.

Is upgrading to TensorFlow 1.15 an option? From what I gather here, they have Ampere support in it

1 Like

This behavior isn’t surprising. You might wish to use a TF container that has already been built with software versions that support Ampere. A container like that is available on ngc and you can see the release notes to discover what software versions it contains.

1 Like

Thank you very much!

It has seemed like a good solution, so I tried it, but I couldn’t solve my problem because of Python2.
Actually, I am using ROS melodic. Therefore, my program needs to run on Python2.

So I tried to find the container with Python2, and the latest version is 20.01-tf1-py2. However, it was released on 01/28/2020 at 08:24 AM.

I thought it doesn’t support Ampere, but I tried then the same problem occurred, as I thought.

Is there any container supporting Ampere with Tensorflow1 and Python2?


[Edit to add more information]
I said, “same problem” but not the same.
Waiting for a long time to start the program is the same, but the program is working.

There isn’t one in NGC that I know of. Python2 support was dropped a while ago. You may find one somewhere else on the web, or you can try building your own.

1 Like

Alright. Thank you very much for your kindness.

Hi @Robert_Crovella , thanks for answering the question. Would you mind elaborating on why this behavior isn’t surprising?
I am in a very similar situation and use a cache to not JIT for every execution, which seems to work fine (Also TF1.14, inside docker with Cuda 10, host driver 460.39 with RTX3090):

export CUDA_CACHE_DISABLE=0
export CUDA_CACHE_MAXSIZE=2147483647

The compilation works and the program processes the data. However, the outputs are seemingly random numbers (which isn’t the case when running the same model from a 2080 without docker). Conceptually, should the JIT compilation work or is running CUDA10 inside a container in principal not supported for Ampere (based on the documentation I saw so far I thought it would be)?
Switching to a different base container means a lot of work in our case, hence I try to understand the problem better first. More information would be very appreciated.

I wouldn’t be able to say anything specific about this case. The reason I find it not surprising that something goes wrong in the original stated case is that the OP was running a container that was never designed to support an Ampere GPU and was never tested on an Ampere GPU before it was released. Bugs are always possible in any software, and such situations, in my experience, are a more likely place for bugs (when software has never been tested on the machine it is running on.)

Theoretically, CUDA has a “forward-compatibility path” which involves JIT of PTX as you already know. If all the libraries in use contain PTX, and compilation settings are correct, then this should allow a binary to run on a newer GPU. Indeed, the fact that the compilation proceeds and even the runtime appears to work without any errors “thrown” by the CUDA runtime, suggests to me that it is working in some fashion. The actual problem in your case may lie somewhere else, or it may be that this forward compatibility mechanism is actually not working correctly in this case, for some reason (bug, etc.) Bugs are always possible.