cudaMalloc(Pitch) _significantly_ slower on windows with Geforce drivers > 350.12

I have a windows 10 x64 system, with two GPUs (Geforce 960, Geforce 770). I investigated a significant slowdown of our GPU-accelerated software after a driver update from Forceware 347.XX to the newest driver, Forceware 372.70.
Profiling revealed that the issue is that in newer drivers (in some driver version > 350.12 and <= 353.49 the ‘cudaMallocPitch’ (and I suppose also the cudaMalloc routine) got slower by a significant factor, which grows with the size of the allocation. The ‘cudaFree’ routine also got slower for big buffers, but not that much than the cudaMallocPitch function.

In the following some measure runtime numbers for my GTX 960 (GTX 770 shows the same behaviour). Cuda Toolkit 7.0 is used and Visual Studio 2013 64-bit. The 372.70 Windows 10 x64 driver was taken from the NVIDIA website, whereas for 350.12 I took the Windows 7/8 driver x64 (!) from GeForce 350.12 WHQL driver download. It can be installed also my Windows 10 x64 system, and seems to works fine (but does not support Pascal generation cards). All times are in milliseconds (ms), for ‘Release’ configuration of Visual studio project.

– 350.12 driver –
cudaMallocPitch for a image of size 1 MB / 20 MB / 400 MB : 0.6 ms / 0.3 ms / 0.4 ms
cudaFree for a image of size 1 MB / 20 MB / 400 MB: 0.1 ms / 0.4 ms / 1.2 ms
– 372.70 driver –
cudaMallocPitch for a image of size 1 MB / 20 MB / 400 MB : 0.5 ms / 1.5 ms / 9 ms
cudaFree for a image of size 1 MB / 20 MB / 400 MB: 0.2 ms / 0.5 ms / 2 ms

One can see with the new driver that cudaMallocPitch got slower by a factor of 5 - 20 (!) for images in the range between 20 and 400 MB. Whereas for the old driver, the cudaMallocPitch always roughly takes a constant amount of time, regardless of the size of the allocated buffer.

I made also experiments with other drivers (downloaded from guru3d): The slowdown seems to occur at least since driver version 353.49 (I took the windows 10 x64 driver from GeForce 353.49 Hotfix driver download). Actually, the slowdown for this driver version, and also for driver version 355.82, is even much worse than for the 372.70 driver, so it seems that this issue has already been partially adressed. I couldn’t install driver version 352.86, so I don’t know whether the slowdown is already in that version.

Unfortunately, this leaves us in a complicated situation, cause we either can use an older driver (meant for Windows 7/8 (!)) which does not support Pascal cards, or a newer driver which supports Pascal but where the slower cudaMalloc routine eats up a significant part of the speedup due to GPU acceleration …

Note this seems to have been reported also inother posting, see the runtime numbers in thread
Additional note: The cuda context overhead seems to have decreased significantly between driver version 350.12 and 372.70 (2.2 seconds on older driver vs. 250 milliseconds on newest driver) - wondering whether there is some relation between this observation and the slowdown of the allocation routines.

Note: The issue seems to occur also on windows 7 x64 system, and I think also on Quadro K6000 cards (but not 100% sure)

Based on your data, consider filing a bug report with NVIDIA.

I wonder whether this may have something to do with the support for demand-paging added for Pascal: the increase in processing time with increasing allocation size could point to calls to the operating system for mapping pages, and the more pages get mapped the slower the call. That is just a hypothesis, of course.

I will consider a bug report, yep.

The hypothesis could be a possible explanation. If so, It would be nice (RFE) to have a function in the CUDA toolkit to globally disable in CUDA the whole page mapping stuff for the execution of a application (e.g. by calling a certain CUDA API fn. at the begin of the program), even if then some functionality (unified memory, on-demand paging) is not available anymore. I don’t use unified memory, so I would not lose something. I just want to have the old (good) runtime of cudaMallocPitch back …

Update: A bug report was filed to NVIDIA.

I made additionally some tests on other systems regarding the significant performance regression in the cudaMalloc(Pitch) function and the (less significant) performance regression in the cudaFree function.

On a Windows 7 x64 system with Quadro K6000 and with the ForceWare 347.62, the cudaMallocPitch function runtime is OK (equal to the Geforce cards runtime with a driver <= 350.12). After installation of the Forceware driver version 369.26 (ODE), I get the same bad runtime results as for the Geforce cards with driver version 372.20 (even slightly worse, cudaMallocPitch takes ~ 12 ms for a 400 MB image - maybe because the K6000 has more RAM than the 770).

On a Windows 8.1 Pro x64 system with Tesla K40c and with the 348.40 driver (TCC mode), the cudaMallocPitch runtime is OK (equal to Geforce cards with driver <= 350.12). Also the Forceware driver 354.99 for the Tesla K40c works fine, no performance regression with the Tesla. I couldn’t get newer drivers for Tesla K40c from the NVIDIA website.

Conclusion for me is that it seems to be a Geforce & Quadro issue, at least for now Tesla seems not to be affected.

Sounds like you’re hitting the issue with WDDM 2.0 in Windows 10.

“downgrade Win10 drivers to 347 (and build against CUDA 6.5 to ensure driver support). This will throw you back to WDDM 1.x, with more TLB-friendly page sizes.”

For Pascal cards, this can be fixed using Windows 10 Anniversary Update + latest Nvidia drivers
+ Windows 10 update history

As of a month ago, this didn’t fix the issue on non-Pascal cards.

I know that some Quadro GPUs support the TCC driver mode, have you tried that for the K6000?

Putting the Geforce Titan X (either Maxwell or Pascal) in TCC driver mode works wonders for my Windows 7 and Windows 8.1 systems.

I assume that for Windows 10 one can put these in TCC driver mode as well, and I also assume that WDDM 2.0 will not interfere with the TCC driver mode. Can anyone verify this who has a Titan X using Windows 10?

Unfortunately, with the 369.29 (ODE) driver for Quadro K6000, there is no ‘nvidia-smi.exe’ more installed - I searched for it. There is a ‘nvidia-smi.ex_’ - I tried out renaming it and starting it, but it complained that it doesn’t match the OS (Windows 7 x64). So I can’t switch the Quadro K6000 to TCC … I think TCC would give me also for the Quadro K6000 the better runtime of the ‘cudaMalloc(Pitch’) function, additionally one would get the benefit of the significantly faster (compared with WDDM) kernel launch times.

Don’t think that your referenced forum posting describes the same issue. As far as I understand, the referenced posting describe an issue with accessing global memory, whereas the issue described my me refers to the allocation of global memory.

For anyone trying to experiment on his system, here is my simple reproducer program.

// TEST program to measure the runtiem of 'cudaMalloc(Pitch)' 
// See the posting at

#include <stdio.h>
#include <stdlib.h>

#include <iostream>
#include <chrono>

#include <cuda.h>
#include <cuda_runtime_api.h>

#include <windows.h>

double PCFreq = 0.0;
__int64 CounterStart = 0;

void StartCounter()
	if (!QueryPerformanceFrequency(&li))
		std::cout << "QueryPerformanceFrequency failed!\n";

	PCFreq = double(li.QuadPart) / 1000.0;

	CounterStart = li.QuadPart;
double GetCounter()
	return double(li.QuadPart - CounterStart) / PCFreq;

int main(int argc, char *argv[])
	// width and height of image 'k', in byte -> for an image with ~ 1 MB, 20 MB and 400 MB
	int width[3] = { 1000, 6000, 12000 };
	int height[3] = { 1000, 3200, 32000 };

	// warmup Cuda context 
	void* dummy1;
	size_t dummy2;
	cudaMallocPitch(&dummy1, &dummy2, 1000, 1000);

	// now measure the runtime for a cudaMallocPitch of an image with ~ 1 MB, 20 MB and 400 MB
	for (int k = 0; k < 3; ++k) {
		void* bufferPtr = 0;
		size_t pitch = 0;
		cudaMallocPitch(&bufferPtr, &pitch, width[k], height[k]);
		float runtimeMs = GetCounter();		
		std::cout << "\n Allocation of one image with " << (pitch * height[k] / 1e6) << " Megabytes takes " << runtimeMs << " milliseconds \n";

	std::cout << "\n\n Hit a key to continue ...";
	int xxx;
	std::cin >> xxx;
	return 0;

That’s very odd. The *.ex_ suggests that the installer possibly failed before the file was uncompressed and installed as a proper *.exe. I have a Quadro here, but am still on the 368.86 Windows 64 driver.

If your observation is not the result of a corrupted installation, I would think it would be due to an improperly packaged driver, which seems very unlikely but not impossible. nvidia-smi has been shipping for a long time and is an indispensable basic GPU admin tool, I don’t think NVIDIA would ever remove it from the packages on purpose.

@njuffa: I was wrong, I was searching in the wrong place. I found the ‘nvidia-smi’ tool and switched one of the two Quadro K6000 to TCC mode. The other one drives the display and is therefore is stuck on WDDM driver.

– GPU 0 (Quadro K6000 in TCC mode, 369.26 driver) –
cudaMallocPitch for a image of size 1 MB / 20 MB / 400 MB : 0.1 ms / 0.2 ms / 0.4 ms
– GPU 1 (Quadro K6000 in WDDM mode, 369.26 driver) –
cudaMallocPitch for a image of size 1 MB / 20 MB / 400 MB : 0.4 ms / 2 ms / 37 ms

Thats a dramatical difference in runtime for ‘bigger’ allocations.

Update: according to information from nvidia, unfortunately it looks like this bug will not be fixed (for various reasons, one of them is that it seems to be really difficult to fix that issue).

Thanks for the update. That’s too bad. I would hope that Microsoft eventually sees the error of their ways and introduces a graphics driver model that is supportive of high-performance implementations.