Performance in basic algorithm Why isn't faster?

LinkStrife · January 8, 2009, 8:39pm

Hi,

I’ve been trying to make faster a piece of code that takes 100000 floats and subtracts 1 to each one.

Comparing the GPU time and CPU time it takes almost the same (including time of cudaMemcpy).

GPU Time: 1.439671 ms

CPU Time: 1.16542 ms

MemCpy Time: 1.4 ms

GPU Time without Memcpy: 0.0381

Why it isn’t faster when I include the CudaMemcpy (thtat is has to be included in the time)?

Here is the code:

[codebox] // KERNEL

global void square_array(float *a, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx<N) a[idx] = a[idx] -1;

}

// MAIN

int main(void)

{

float *a_h, *a_d; // Pointer to host & device arrays

const int num = 100000; // Number of elements in arrays

unsigned int hTimer; // Timer counter

size_t size = num * sizeof(float);

a_h = (float *)malloc(size); // Allocate array on host

cudaMalloc((void **) &a_d, size); // Allocate array on device

CUT_SAFE_CALL( cutCreateTimer(&hTimer) );

// Initialize host array and copy it to CUDA device

for (int i=0; i<num; i++)

{

	a_h[i] = (float)i;

}

cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);

// Do calculation on device:

int block_size = 4;

int n_blocks = num/block_size + (num%block_size == 0 ? 0:1);

CUT_SAFE_CALL( cutResetTimer(hTimer) );

CUT_SAFE_CALL( cutStartTimer(hTimer) );

square_array <<< n_blocks, block_size >>> (a_d, num);

CUT_SAFE_CALL( cutStopTimer(hTimer) );

// Retrieve result from device and store it in host array

cudaMemcpy(a_h, a_d, sizeof(float)*num, cudaMemcpyDeviceToHost);

printf(“GPU time: %f msecs.\n”, cutGetTimerValue(hTimer));

getchar();

// Print results

for (int i=0; i<num; i++)

{

	printf("%d %f\n", i, a_h[i]);

}

getchar();

CUT_SAFE_CALL( cutResetTimer(hTimer) );

CUT_SAFE_CALL( cutStartTimer(hTimer) );

Filtro1CPU(a_h,num);

CUT_SAFE_CALL( cutStopTimer(hTimer) );

printf(“CPU time: %f msecs.\n”, cutGetTimerValue(hTimer));

getchar();

// Cleanup

free(a_h); cudaFree(a_d);

CUT_SAFE_CALL( cutDeleteTimer(hTimer) );

} [/codebox]

I’ll be very gratefull f someone helps me.

Cya

tmurray · January 8, 2009, 9:36pm

you need to include a cudaThreadSynchronize after your kernel call if you want to accurately time things.

LinkStrife · January 9, 2009, 12:20pm

Now the GPU takes more time to process!!

=(((((

Why does it happen?

Ty

cbuchner1 · January 9, 2009, 2:46pm

Kernel call Overhead.

The GPU is only faster when it processes quite a large chunk of data at once. Also you

need to watch that you’re using the GPU’s capabilities fully. This means making

sure memory access is coalesced and you used shared memory for speed-up where

possible.

In your particular kernel I’d say you are not doing enough work per thread to obtain

a speed-up. A simple addition or subtraction is not OK.

Try adding or subtracting 16 consecutive values per thread (no loops please - everything

unrolled).

Christian

LinkStrife · January 9, 2009, 5:24pm

I added like 200 values just to test and now it is a lot faster!!!

Thx!!!

Topic		Replies	Views
CUDA trouble CUDA Programming and Performance	3	977	March 19, 2013
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6547	February 19, 2009
Cannot find a reason why CPU process much faster than GPU process in simple code CUDA Programming and Performance	3	493	November 19, 2018
Why the following multigpu code works faster when I set GPU_N=1 while it is slower for GPU_N=4? CUDA Programming and Performance cuda	1	629	September 21, 2020
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25087	March 8, 2010
Slow memory transfers CUDA Programming and Performance	7	1995	May 23, 2011
Memory Transfer CUDA Programming and Performance	7	2960	October 10, 2008
CUDA slower than CPU? CUDA Programming and Performance	7	825	August 18, 2023
cudaMemcpy execution time CUDA Programming and Performance	5	6807	June 17, 2010
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10265	June 5, 2009

Performance in basic algorithm Why isn't faster?

Related topics