Hi all,
I’m seeing some very strange (and slow) performance from a cudaMalloc call to allocate space for an array of floats. I get up to 10 seconds of waiting when allocating 300k floats. I’ve isolated the call in a separate piece of code that can be tested (below).
Strangely, the test routinely takes 8-10 seconds on a Tesla 2050, but frequently takes no time at all on a GTX480. It appears that this behavior depends on the particular hardware being used.
There is no X server running on the Tesla, and there is one on the GTX480. Both systems use CUDA 4.0 with the 270.41.19 driver on a CentOS 5.5 (x86_64).
Is it a bug? Is there an undocumented syntax issue?
Any suggestions would be much appreciated.
Sasha
global void kernel(){
printf("kernel executed\n");
}
int main(){
int size = 300000;
float *result = new float;
float *dev_result;
std::cout << "Allocating memory..." << std::endl;
cudaMalloc((void**)&dev_result, size * sizeof(float));
std::cout << "Completed" << std::endl;
kernel<<<1,1>>>();
cudaFree(dev_result);
delete [] result;
}