Serialization vs. looping

Hello together,

I have a general question about OpenCL Performance.

Imagine, 1.000 threads run concurrently on a GPU, but I have a data size of 10.000 elements. Each element has to be calculated, lets say it has to be doubled.

What would be the better way to do this?

Way 1: Serialization of threads:

Start 10.000 Threads:

Then the code is:

output[get_global_id(0)] = 2*input[get_global_id(0)];

Way 2: Looping

Start 1.000 Threads

Then the code is:


   output[get_global_id(0)+i*1.000] = 2*input[get_global_id(0)+i*1000];


Which of the two ways is the faster one?

I hope I made myself clear, and appreciate your answers.


I just did this tests yesterday (AMD, Juniper core) and the speed for your kernel1 is 20% faster than kernel2 assuming that the stride in the for-loop is big enough. In my case the stride for Kernel2 had to be bigger than 8192 possibly to have a shorter for loop and more threads. If the loop has less than 32 iterations it also seems to work fine. If however the for-loop has a step of 1 (stride = 1), it will run much slower (by 10x) even if multiple threads are launched (working on different sections of the same array). I cant say exactly why. I tried a stride different than 1 for Kernel1 and it worked slower always. I am guessing that that for-loop in kernel2 is slower, because of lack of vectorized optimization in the loops and that the difference is not hardware related.