Extremely slow cudaMalloc

Hi all,

I’m seeing some very strange (and slow) performance from a cudaMalloc call to allocate space for an array of floats. I get up to 10 seconds of waiting when allocating 300k floats. I’ve isolated the call in a separate piece of code that can be tested (below).

Strangely, the test routinely takes 8-10 seconds on a Tesla 2050, but frequently takes no time at all on a GTX480. It appears that this behavior depends on the particular hardware being used.

There is no X server running on the Tesla, and there is one on the GTX480. Both systems use CUDA 4.0 with the 270.41.19 driver on a CentOS 5.5 (x86_64).

Is it a bug? Is there an undocumented syntax issue?

Any suggestions would be much appreciated.



global void kernel(){

printf("kernel executed\n");


int main(){

int size = 300000;

float *result = new float;

float *dev_result;

std::cout << "Allocating memory..." << std::endl;

cudaMalloc((void**)&dev_result, size * sizeof(float));

std::cout << "Completed" << std::endl;



delete [] result;


Enable the persistent mode of the driver ( nvidia-smi -pm ) or keep nvidia-smi running in a loop.

ssh c0-4 “time ./a.out” ( node with driver in non persistent mode )
Allocating memory…
kernel executed

real 0m1.998s
user 0m0.003s
sys 0m1.916s

ssh c0-0 “time ./a.out” ( node with driver in persistent mode)
Allocating memory…
kernel executed

real 0m0.140s
user 0m0.005s
sys 0m0.125s

Thanks for the advice
the persistence mode did the trick in the production code.