Performance in basic algorithm Why isn't faster?


I’ve been trying to make faster a piece of code that takes 100000 floats and subtracts 1 to each one.

Comparing the GPU time and CPU time it takes almost the same (including time of cudaMemcpy).

GPU Time: 1.439671 ms

CPU Time: 1.16542 ms

MemCpy Time: 1.4 ms

GPU Time without Memcpy: 0.0381

Why it isn’t faster when I include the CudaMemcpy (thtat is has to be included in the time)?

Here is the code:

[codebox] // KERNEL

global void square_array(float *a, int N)


int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx<N) a[idx] = a[idx] -1;



int main(void)


float *a_h, *a_d; // Pointer to host & device arrays

const int num = 100000; // Number of elements in arrays

unsigned int hTimer; // Timer counter

size_t size = num * sizeof(float);

a_h = (float *)malloc(size); // Allocate array on host

cudaMalloc((void **) &a_d, size); // Allocate array on device

CUT_SAFE_CALL( cutCreateTimer(&hTimer) );

// Initialize host array and copy it to CUDA device

for (int i=0; i<num; i++)


	a_h[i] = (float)i; 


cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);

// Do calculation on device:

int block_size = 4;

int n_blocks = num/block_size + (num%block_size == 0 ? 0:1);

CUT_SAFE_CALL( cutResetTimer(hTimer) );

CUT_SAFE_CALL( cutStartTimer(hTimer) );

square_array <<< n_blocks, block_size >>> (a_d, num);

CUT_SAFE_CALL( cutStopTimer(hTimer) );

// Retrieve result from device and store it in host array

cudaMemcpy(a_h, a_d, sizeof(float)*num, cudaMemcpyDeviceToHost);

printf(“GPU time: %f msecs.\n”, cutGetTimerValue(hTimer));


// Print results

for (int i=0; i<num; i++)


	printf("%d %f\n", i, a_h[i]); 



CUT_SAFE_CALL( cutResetTimer(hTimer) );

CUT_SAFE_CALL( cutStartTimer(hTimer) );


CUT_SAFE_CALL( cutStopTimer(hTimer) );

printf(“CPU time: %f msecs.\n”, cutGetTimerValue(hTimer));


// Cleanup

free(a_h); cudaFree(a_d);

CUT_SAFE_CALL( cutDeleteTimer(hTimer) );

} [/codebox]

I’ll be very gratefull f someone helps me.


you need to include a cudaThreadSynchronize after your kernel call if you want to accurately time things.

Now the GPU takes more time to process!!


Why does it happen?


Kernel call Overhead.

The GPU is only faster when it processes quite a large chunk of data at once. Also you

need to watch that you’re using the GPU’s capabilities fully. This means making

sure memory access is coalesced and you used shared memory for speed-up where


In your particular kernel I’d say you are not doing enough work per thread to obtain

a speed-up. A simple addition or subtraction is not OK.

Try adding or subtracting 16 consecutive values per thread (no loops please - everything



I added like 200 values just to test and now it is a lot faster!!!