how can i calc the correct count of blocks/threads for a variable vector size

how can i calc the correct count of blocks and threads for my problem with a variable size of vectors for the best performance.

now i choose the fix values 256 & 256 but it has a very slow performance. It is necessary to indicated shared memory size?

how can i reach the best performance for every vektorsize? i have a G80 GPu type but i don’t understand the “CUDA GPU Occupancy Calculator”.

when i set both 256 values to 1 i get the best performances but it is slower than the cpu execution

Whats my fault?

Please Help Me!

Caller Code

ComplexProduct<<<256, 256>>>( vector1_gpu, vector2_gpu, result_gpu, size);

Kernel Code

__global__ void VectorSubtraction(float* vector1, float* vector2, float* result, int size){

	const int tid = blockDim.x * blockIdx.x + threadIdx.x;

	const int THREAD_N = blockDim.x * gridDim.x;

	for(int opt = tid; opt < size; opt += THREAD_N){

  result[opt] = vector1[opt] - vector2[opt];




You have to try different sizes and choose one with best performance. There’s no easy way to find optimal parameters.

For your kernel you know its resource usage (registers and shared memory) and this gives you the upper limit on block size. Generally you should maximize occupation of your kernel, but this doesn’t mean optimal performance. For example, to hide memory latency you may try to reduce block size to allow more blocks per multiprocessor.

BTW, with such simple kernel you won’t get good performance becuse you are limited by memory bndwidth.

But it isn’t right that i get my best performance with a gridsize by 1 and a thread size by 1.

There is a fault. But Where…

Please Help Me!

How do you measure timings of your kernels? Do you use cudaThreadSyncronize()?


CUT_SAFE_CALL( cutCreateTimer(&hTimer) );

CUT_SAFE_CALL( cutResetTimer(hTimer) );

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL( cutStartTimer(hTimer) );


VectorSubtraction<<< DimGrid, DimBlock, SharedMemBytes>>>( vector1_gpu, vector2_gpu, result_gpu, size);


CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_SAFE_CALL( cutStopTimer(hTimer) );

printf("GPU time: %f (s).\n", cutGetTimerValue(hTimer)*0.001);

CUT_SAFE_CALL( cutDeleteTimer(hTimer) );

That’s strange. Can you post some performance data that you’ve got?
Try running block of 256 threads and grid of 1024 or more blocks.

Thanks that you help me . i post the results monday pm .


after a few test i think my kernel code is fault … the for loop produce the traffic… i refactoring my kernels and then i test and measur again with correct kernels…

thanks and bye for now

i hope i see you soon!