CUDA Kernels return NAN using Ubuntu 18.04 but work with 16.04


I hope this is the right forum to post this topic.

I have developed a CUDA application which I want to deploy on cloud platforms like AWS and GCP. My production system is based on Ubuntu 18.04 and this is the major linux distro which I want to support; my customer base is pretty small and I don’t want to have a huge spread of distributions to support.

The application is working fine on my production system and any Ubuntu 18.04 based local system to which I have deployed it. When deploying it to a 18.04 Ubuntu Server system on GCP (or AWS, I tried both) I can get the application to run and behave normally; however calculations on the GPU return NAN. Here are my deployment steps:

  • Create a new Ubuntu 18.04 server instance
  • sudo update
  • sudo upgrade
  • sudo apt install libgl1-mesa-dev libxt-dev libegl1-mesa libxrender-dev libxi-dev libfontconfig1-dev xvfb libhdf5-dev
  • sudo apt install nvidia-driver-390
  • Reboot the system
  • Copy the application (and library dependencies) to the instance and run it

What is odd is that when I try the same steps with using a Ubuntu 16.04 server version (and of course building the application in 16.04) then everything is working with the above deployment steps out of the box. The only differences I can see are that I am using an older toolkit version (7.5 which is the default which is coming with 16.04 instead of 9.1 which comes with 18.04) and the older 384 driver.

Is there something that I am missing here? I would settle for a solution where I need to build 16.04 versions of my applications just for deployment in the cloud, but that’s not satisfactory since it does not solve the issue and also GCP uses newer GPUs with compute capability 7.0 which are not supported by the toolkit 7.5 provided by U16.04.