I have a windows 10 x64 system, with two GPUs (Geforce 960, Geforce 770). I investigated a significant slowdown of our GPU-accelerated software after a driver update from Forceware 347.XX to the newest driver, Forceware 372.70.
Profiling revealed that the issue is that in newer drivers (in some driver version > 350.12 and <= 353.49 the ‘cudaMallocPitch’ (and I suppose also the cudaMalloc routine) got slower by a significant factor, which grows with the size of the allocation. The ‘cudaFree’ routine also got slower for big buffers, but not that much than the cudaMallocPitch function.
In the following some measure runtime numbers for my GTX 960 (GTX 770 shows the same behaviour). Cuda Toolkit 7.0 is used and Visual Studio 2013 64-bit. The 372.70 Windows 10 x64 driver was taken from the NVIDIA website, whereas for 350.12 I took the Windows 7/8 driver x64 (!) from http://www.guru3d.com/files-details/geforce-350-12-whql-driver-download.html. It can be installed also my Windows 10 x64 system, and seems to works fine (but does not support Pascal generation cards). All times are in milliseconds (ms), for ‘Release’ configuration of Visual studio project.
– 350.12 driver –
cudaMallocPitch for a image of size 1 MB / 20 MB / 400 MB : 0.6 ms / 0.3 ms / 0.4 ms
cudaFree for a image of size 1 MB / 20 MB / 400 MB: 0.1 ms / 0.4 ms / 1.2 ms
– 372.70 driver –
cudaMallocPitch for a image of size 1 MB / 20 MB / 400 MB : 0.5 ms / 1.5 ms / 9 ms
cudaFree for a image of size 1 MB / 20 MB / 400 MB: 0.2 ms / 0.5 ms / 2 ms
One can see with the new driver that cudaMallocPitch got slower by a factor of 5 - 20 (!) for images in the range between 20 and 400 MB. Whereas for the old driver, the cudaMallocPitch always roughly takes a constant amount of time, regardless of the size of the allocated buffer.
I made also experiments with other drivers (downloaded from guru3d): The slowdown seems to occur at least since driver version 353.49 (I took the windows 10 x64 driver from http://www.guru3d.com/files-details/geforce-353-49-hotfix-driver-download.html). Actually, the slowdown for this driver version, and also for driver version 355.82, is even much worse than for the 372.70 driver, so it seems that this issue has already been partially adressed. I couldn’t install driver version 352.86, so I don’t know whether the slowdown is already in that version.
Unfortunately, this leaves us in a complicated situation, cause we either can use an older driver (meant for Windows 7/8 (!)) which does not support Pascal cards, or a newer driver which supports Pascal but where the slower cudaMalloc routine eats up a significant part of the speedup due to GPU acceleration …
Note this seems to have been reported also inother posting, see the runtime numbers in thread https://devtalk.nvidia.com/default/topic/831150/cuda-programming-and-performance/titan-x-with-latest-drivers-slower-than-titan-black-with-older-drivers/2
Additional note: The cuda context overhead seems to have decreased significantly between driver version 350.12 and 372.70 (2.2 seconds on older driver vs. 250 milliseconds on newest driver) - wondering whether there is some relation between this observation and the slowdown of the allocation routines.
Note: The issue seems to occur also on windows 7 x64 system, and I think also on Quadro K6000 cards (but not 100% sure)