CUDA Memory Management problem introduced in 11.2

I updated my server to Ubuntu 20.04 last week and was forced to update from CUDA 10.2 to 11.2 in that step. And obviously this included a new bug in the the memory management. To sum it up, I use multiple models for a CNN based neural networks, due to this problems I deactivated cudnn, so I’m talking ONLY about CUDA 11.2-1. And this version seems to allocate random amounts of memory to the layers O_O

For example:

  1. Model: 736x736 neural network, 104 Layers
  2. Model: 320x320 neural network, 73 Layers

The 1. Model allocated (yesterday) 3.7 GB of memory, the 2. Model allocated 4.1 GB of memory. With 10.2 the 2. Model allocated about 1 GB. I debugged my code for more than 5 hours and I can not find any error on my part.
I tested it again today and this time I get different numbers for the memory with exactly the same code and also I get different numbers on an RTX 3090 than on a RTX 2080.
Later I will post some detailed information, I wanted to check these new effects first. But for me this looks like 11.2 can not be trusted for my kind of applications :(.

In the meantime, you should be able to install CUDA 10.2 along side the Ubuntu apt-based install. Just use the runfile method (and skip the driver installation):

https://developer.nvidia.com/cuda-10.2-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal

If you want to switch between the two, just change the symlink that points to /usr/local/cuda

One point to note, is that CUDA 10.2 doesn’t have native support for the 30xx GPUs, so those depend on JIT compilation. Not experienced in CUDA DL approaches so I can’t specifically comment on that.

Actually this seems more like a feature (?) than a a bug. What I found is:

  1. cuda 11.2 seems to use very aggressive memory allocation algorithm. Basically when I allocate my structure it goes up to 7 GB about halfway and then falls down to 1.7 GB and then goes up again to 3.6 GB. I’m not sure what this is good for …

  2. Still if I use the same code on different, days there is a small but noticeable difference in the memory allocation. So if it is close to the memory limit, this will give you this effect of sometimes crashing.

In combination these 2 effects got me really confused while debugging. I still think it is a bug, but at least the effect is “only” 200 MB for a structure of about 4 GB in total.