We are getting "RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED" error. Can anyone please help us?

While running the code in linux, there was no error and running properly. But when we tried to build an image and running the container then we are getting above error. We are using “Nvidia Advantech mic 711-0x” device to run the code or docker container. In Dockerfile we used base image as “nvidia/cuda:12.2.2-cudnn8-devel-ubuntu22.04”. In our device cuda version is 12.2 and cuDNN version is 8.9.0.4.

Hi, is the linux system separate from the Advantech? What command are you using to create the container from the image, maybe you need to give some permission to the container.

Daniel Gonzalez
Embedded SW Engineer at RidgeRun
Contact us: support@ridgerun.com
Developers wiki: https://developer.ridgerun.com/
Website: www.ridgerun.com

Hi @daniel.gonzalez3, no linux I mentioned is part of advantech. We flashed ubuntu 22.04 using sdk manager with the help of Intel NUC in Advantech. We are using “docker run --rm < image name >” and some required command line arguments for code. I changed daemon.json file to ‘{\n “default-runtime”: “nvidia”,\n “runtimes”: {\n “nvidia”: {\n “path”: “/usr/bin/nvidia-container-runtime”,\n “runtimeArgs”: [ ]\n }\n }\n}’ to enable cuda drivers (for GPU) in docker. Can you please tell us which permissions should be given.

Hi @pranavvoleti, sorry for the late reply, I understand better now. Your code works well in your system without Docker right? Maybe I can make some suggestions.

Try to use a different docker image as base. You can try to look for one that fits your requirements in the NGC catalog

This might be too complex, but If you manually installed cuda in the advantech you could try to change your docker base image with a simpler one and replicate the process in the Dockerfile. Or maybe there is something else that you did in the system that is missing in the Dockerfile.

Daniel Gonzalez
Embedded SW Engineer at RidgeRun
Contact us: support@ridgerun.com
Developers wiki: https://developer.ridgerun.c om/
Website: www.ridgerun.com

Hi @daniel.gonzalez3, I took the base image from that catalog only. But I am getting that error. I changed many base images we got many other errors. But the image I mentioned in above message at least started the training and stopped immediately by showing above error. If you need anything related to our work to solve this, I can try to provide you.

Hi @pranavvoleti , sorry for late respond, I see, could you maybe share more information about the kind of enviroment you are trying to run, is this a python/conda enviroment? Could you share some sections of the Dockerfile you are using?

Daniel Gonzalez
Embedded SW Engineer at RidgeRun