about overhead of cudaMalloc() function how many time does cudaMalloc take to allooc memory space?

Three arrayA of 1023 elements and three arrayB of 1024 elements be allocated as bellow

int *a1,*a2,*a3;
int *b1,*b2,*b3;
int sizeA = 1023 * sizeof(int);
int sizeB = 1024 * sizeof(int);

cutStartTimer(timer);
CUDA_SAFE_CALL(cudaMalloc( (void**) &a1, sizeA));
CUDA_SAFE_CALL(cudaMalloc( (void**) &a1, sizeA));
CUDA_SAFE_CALL(cudaMalloc( (void**) &a1, sizeA));
cutStopTimer(timer);

printf(“alloc time = %f”, cutGetTimerValue(timer)); // alloc time = 0.016ms

cutStartTimer(timer);
CUDA_SAFE_CALL(cudaMalloc( (void**) &a1, sizeB));
CUDA_SAFE_CALL(cudaMalloc( (void**) &a1, sizeB));
CUDA_SAFE_CALL(cudaMalloc( (void**) &a1, sizeB));
cutStopTimer(timer);

printf(“alloc time = %f”, cutGetTimerValue(timer)); // alloc time = 0.825ms

can anyone tell what makes difference between times of allocating these array of 1023 elements and 1024 elements ?

First of all, you’re overriding sizeA, so all mallocs using the same size.

If this is just a typo in your post, but not in your application, try cudaMalloc() in a loop with 10 to 100 loops and divide the measured time with it. This is more accurate, because the noise in your single measurement could lead to weird results…

Check out this thread.

Here’s the output of the code (modified to make it do what you want it to do :) ) in a Linux environment:

alloc time = 178.873993 (1023)
alloc time = 179.369995 (1024)

N.