Hello together,
I have a general question about OpenCL Performance.
Imagine, 1.000 threads run concurrently on a GPU, but I have a data size of 10.000 elements. Each element has to be calculated, lets say it has to be doubled.
What would be the better way to do this?
Way 1: Serialization of threads:
Start 10.000 Threads:
Then the code is:
output[get_global_id(0)] = 2*input[get_global_id(0)];
Way 2: Looping
Start 1.000 Threads
Then the code is:
for(i=0;i<10;i++){
output[get_global_id(0)+i*1.000] = 2*input[get_global_id(0)+i*1000];
}
Which of the two ways is the faster one?
I hope I made myself clear, and appreciate your answers.
Hi!
I just did this tests yesterday (AMD, Juniper core) and the speed for your kernel1 is 20% faster than kernel2 assuming that the stride in the for-loop is big enough. In my case the stride for Kernel2 had to be bigger than 8192 possibly to have a shorter for loop and more threads. If the loop has less than 32 iterations it also seems to work fine. If however the for-loop has a step of 1 (stride = 1), it will run much slower (by 10x) even if multiple threads are launched (working on different sections of the same array). I cant say exactly why. I tried a stride different than 1 for Kernel1 and it worked slower always. I am guessing that that for-loop in kernel2 is slower, because of lack of vectorized optimization in the loops and that the difference is not hardware related.
Regards!
Atmapuri