2 level of parallel

I am confuse with a problem, I want to know if we can do the following:

I have an algorithm and it should run in parallel:
if we have and “array a” , and for each element in this array it will execute the kernel code, - this is OK for me- , but if I have 3 different data for “array a”, how we can run 3 set of a in parallel, so here we have two level of parallel:
1.element of a are executing code in parallel
2. 3 different a (run a on three different data set)

hope that was clear.

Hello alhowaidi!
Lets say that you have AxB amount of elements in a, that is a[A][B]. Instead of creating 3 different a, you could add all items to a by expanding it to be a[3A][3B] (or a[3A-1][3B-1] if you start at a[0][0]). What you want to do then is to use clever offset for your threads so that the first set of threads access a[0][0] to a[A-1][B-1], the second set of threads access a[A][B] to a[2A-1][2B-1], the third set access a[2A][2B] to a[3A-1][3B-1].

Are you familiar with how to use work groups or 2D/3D NDRange? If so, the offset could be very easily calculated.

Please let me know if something is unclear or need further explanation.

From the sounds of what you are trying to do, you’d need a unique GPU for each a that you want to process simultaneously. I have had success with transferring data and processing a kernel simultaneously with 1 GPU (2 contexts).

Hello omgi,

I will put a code here, ‘a’ is 2 dim array a[row][col], and C[row] is one dimensional array, if I want to execute the kernel on 3 different data for a and C, so its only expanding a[row][col] and C to contain the 3 arrays? and can you give more explanation please about calculating the offset?

__kernel void test(const __global float* a ,

                    const __global float * C,                        

		__global float * Output,                        

                    const int col)


const int ar = get_local_id(0);

float sum=0;

for(int j=0;j<col; ++j)


sum += C[j] * a[ar*col+j];