Acid_2
December 13, 2007, 9:56am
1
how can i calc the correct count of blocks and threads for my problem with a variable size of vectors for the best performance.
now i choose the fix values 256 & 256 but it has a very slow performance. It is necessary to indicated shared memory size?
how can i reach the best performance for every vektorsize? i have a G80 GPu type but i don’t understand the “CUDA GPU Occupancy Calculator”.
when i set both 256 values to 1 i get the best performances but it is slower than the cpu execution
Whats my fault?
Please Help Me!
Caller Code
ComplexProduct<<<256, 256>>>( vector1_gpu, vector2_gpu, result_gpu, size);
Kernel Code
__global__ void VectorSubtraction(float* vector1, float* vector2, float* result, int size){
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
const int THREAD_N = blockDim.x * gridDim.x;
for(int opt = tid; opt < size; opt += THREAD_N){
result[opt] = vector1[opt] - vector2[opt];
}
}
Thanks
AndreiB
December 13, 2007, 11:21am
2
You have to try different sizes and choose one with best performance. There’s no easy way to find optimal parameters.
For your kernel you know its resource usage (registers and shared memory) and this gives you the upper limit on block size. Generally you should maximize occupation of your kernel, but this doesn’t mean optimal performance. For example, to hide memory latency you may try to reduce block size to allow more blocks per multiprocessor.
BTW, with such simple kernel you won’t get good performance becuse you are limited by memory bndwidth.
Acid_2
December 13, 2007, 1:02pm
3
But it isn’t right that i get my best performance with a gridsize by 1 and a thread size by 1.
There is a fault. But Where…
Please Help Me!
AndreiB
December 13, 2007, 1:30pm
4
How do you measure timings of your kernels? Do you use cudaThreadSyncronize()?
Acid_2
December 13, 2007, 1:50pm
5
yes
CUT_SAFE_CALL( cutCreateTimer(&hTimer) );
CUT_SAFE_CALL( cutResetTimer(hTimer) );
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutStartTimer(hTimer) );
...
VectorSubtraction<<< DimGrid, DimBlock, SharedMemBytes>>>( vector1_gpu, vector2_gpu, result_gpu, size);
...
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutStopTimer(hTimer) );
printf("GPU time: %f (s).\n", cutGetTimerValue(hTimer)*0.001);
CUT_SAFE_CALL( cutDeleteTimer(hTimer) );
AndreiB
December 13, 2007, 3:34pm
6
That’s strange. Can you post some performance data that you’ve got?
Try running block of 256 threads and grid of 1024 or more blocks.
Acid_2
December 14, 2007, 9:28am
7
That’s strange. Can you post some performance data that you’ve got?
Try running block of 256 threads and grid of 1024 or more blocks.
[snapback]293929[/snapback]
Thanks that you help me . i post the results monday pm .
Acid_2
December 17, 2007, 2:25pm
8
hi…
after a few test i think my kernel code is fault … the for loop produce the traffic… i refactoring my kernels and then i test and measur again with correct kernels…
thanks and bye for now
i hope i see you soon!