2 level of parallel

I am confuse with a problem, I want to know if we can do the following:

I have an algorithm and it should run in parallel:
if we have and “array a” , and for each element in this array it will execute the kernel code, - this is OK for me- , but if I have 3 different data for “array a”, how we can run 3 set of a in parallel, so here we have two level of parallel:
1.element of a are executing code in parallel
2. 3 different a (run a on three different data set)

Lets say that you have AxB amount of elements in a, that is a[A][B]. Instead of creating 3 different a, you could add all items to a by expanding it to be a[3A][3B] (or a[3A-1][3B-1] if you start at a[0][0]). What you want to do then is to use clever offset for your threads so that the first set of threads access a[0][0] to a[A-1][B-1], the second set of threads access a[A][B] to a[2A-1][2B-1], the third set access a[2A][2B] to a[3A-1][3B-1].

Are you familiar with how to use work groups or 2D/3D NDRange? If so, the offset could be very easily calculated.

From the sounds of what you are trying to do, you’d need a unique GPU for each a that you want to process simultaneously. I have had success with transferring data and processing a kernel simultaneously with 1 GPU (2 contexts).

I will put a code here, ‘a’ is 2 dim array a[row][col], and C[row] is one dimensional array, if I want to execute the kernel on 3 different data for a and C, so its only expanding a[row][col] and C to contain the 3 arrays? and can you give more explanation please about calculating the offset?

__kernel void test(const __global float* a ,

                    const __global float * C,                        

		__global float * Output,                        

                    const int col)


const int ar = get_local_id(0);

float sum=0;

for(int j=0;j<col; ++j)


sum += C[j] * a[ar*col+j];