Windows 7 vs Windows xp CudaMalloc performance difference

Hello Everyone,

I had a quick question regarding the performance of cudamalloc opeartion on Win 7 machine and xp machine.
For about 3.5 Kbyte of memory allocation , it takes about 1.4msec on an win7 machine and < than 0.1ms on an xp machine.
Is it a known fact that the cudaMalloc operation is a lot slower on an windows 7 machine when compared to an xp machine? If so what is the reason? and is there any possible work around?

Any inputs is greatly appreciated


Any one?

I just got this from one of our users:

Any insight would be appreciated.

Thanks Tom ! I have been facing exactly the same problems. The CudaThreadSynchronize() is ridiculously slow as is cudaMalloc().

Is there any work around for this?



This is expected due to the overhead of having to interact with the Windows display driver scheduler on Vista/Win7. TCC mode doesn’t have this performance impact.

Allocate a large block of device memory then manage it as a heap from your CPU program. This eliminates latency and synchronization overhead.

hmm yeah I am in the process of implementing that.

Had another question regarding cuda and OS. Does the OS have any role to play once the kernel is launched? as in thread scheduling/memory management etc? This may sound stupid, but the reason i am asking this is I have a kernel which takes about 700 ms on xp and 1.4 seconds on win 7 and this is only the kernel execution time. I have gtx 285 on both machines. This seems to be the issue only when different threads work on memory areas which are wide apart.