Exceptionally slow cudaMalloc() after upgrading to driver version 384.66 on Linux

After upgrading to version 384.66 of the NVIDIA Linux kernel driver (the kernel is version 3.10.0-693.2.2.el7.x86_64 on RHEL 7.4), we observe very size-dependent allocation latency when using cudaMalloc(). For example, we use the program at http://lpaste.net/2039592059479785472 to measure how long it takes to allocate memory blocks of various sizes. On a machine with a GTX 780Ti and driver version 384.66, the last three lines of output are the following:

134217728 bytes; average: 8144us; min: 4120us; max: 8208us

268435456 bytes; average: 16248us; min: 8201us; max: 16346us

536870912 bytes; average: 32461us; min: 16390us; max: 32738us

On an otherwise identical machine running driver version 375.66, we obtain:

134217728 bytes; average: 254us; min: 248us; max: 301us

268435456 bytes; average: 372us; min: 356us; max: 426us

536870912 bytes; average: 600us; min: 587us; max: 645us

While there is still mild sensitivity to the allocation size (as is understandable), allocations run much faster overall. We observe the same slowdown on other (non-contrived) programs, to the degree that much of our CUDA code is now bottlenecked by the time it takes to allocate memory.

I’ve seen this as well. I believe it is expected behavior.

If you don’t like it, my suggestion would be to file a bug at developer.nvidia.com and request that it be investigated or improved.

Good suggestion. I have filed it as a bug. Surely I can’t be the first one who is annoyed by this? Sure, it’s always good practise to keep allocation to a minimum, but 30ms for 512MiB seems incredibly slow.