Extremely slow cudaMalloc

Hi all,

I’m seeing some very strange (and slow) performance from a cudaMalloc call to allocate space for an array of floats. I get up to 10 seconds of waiting when allocating 300k floats. I’ve isolated the call in a separate piece of code that can be tested (below).

Strangely, the test routinely takes 8-10 seconds on a Tesla 2050, but frequently takes no time at all on a GTX480. It appears that this behavior depends on the particular hardware being used.

There is no X server running on the Tesla, and there is one on the GTX480. Both systems use CUDA 4.0 with the 270.41.19 driver on a CentOS 5.5 (x86_64).

Is it a bug? Is there an undocumented syntax issue?

Any suggestions would be much appreciated.

Sasha

#include

global void kernel(){

printf("kernel executed\n");

}

int main(){

int size = 300000;

float *result = new float;

float *dev_result;

std::cout << "Allocating memory..." << std::endl;

cudaMalloc((void**)&dev_result, size * sizeof(float));

std::cout << "Completed" << std::endl;

kernel<<<1,1>>>();

cudaFree(dev_result);

delete [] result;

}

Enable the persistent mode of the driver ( nvidia-smi -pm ) or keep nvidia-smi running in a loop.

ssh c0-4 “time ./a.out” ( node with driver in non persistent mode )
Allocating memory…
Completed
kernel executed

real 0m1.998s
user 0m0.003s
sys 0m1.916s

ssh c0-0 “time ./a.out” ( node with driver in persistent mode)
Allocating memory…
Completed
kernel executed

real 0m0.140s
user 0m0.005s
sys 0m0.125s

Thanks for the advice
the persistence mode did the trick in the production code.