Kernel sincronization problem

I need to perform this code in opencl.

for(i = 0; i < M; i++)
{
for( j = 0; j < N; j++)
{
tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];
}
}

Where f is a Mx(N+1) matrix and tau is a NxN matrix.

When I set the range to M x N and try to perform the computation, i need to do sincronization or an operation can override the other. I tried to use the atom_add but Nvidia implementation just allowed for int and long. I need float.

What I have to do?

I need to perform this code in opencl.

for(i = 0; i < M; i++)
{
for( j = 0; j < N; j++)
{
tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];
}
}

Where f is a Mx(N+1) matrix and tau is a NxN matrix.

When I set the range to M x N and try to perform the computation, i need to do sincronization or an operation can override the other. I tried to use the atom_add but Nvidia implementation just allowed for int and long. I need float.

What I have to do?

Why you need the synchronization? Is the matrix tau indexed dependent on previous write? What’s k - variable or constant? Please clarify your code a bit more, from first point of view there seems no need of synchronization.

If there is really a need to have previous values of input matrices make 2 identical inputs - one for read and one for write. Or another options is to end the whole kernel and start the computation once again. There is no barrier for all threads, just for a block of threads.

Why you need the synchronization? Is the matrix tau indexed dependent on previous write? What’s k - variable or constant? Please clarify your code a bit more, from first point of view there seems no need of synchronization.

If there is really a need to have previous values of input matrices make 2 identical inputs - one for read and one for write. Or another options is to end the whole kernel and start the computation once again. There is no barrier for all threads, just for a block of threads.

Sorry, the code was wrong. There is the right one:

for(k = 0; k < M; k++)

{

for( i = 0; i < N; i++)

{

tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];

}

}

My problem is… if f is:

[0,1,2,3]

[0,1,2,3]

Then, when i try to update tau, it will try to update tau[0][1] twice, then tau[1][2] twice and go. That’s why i need syncronization.

L is a vector -> size M, and Q a constant.

Sorry, the code was wrong. There is the right one:

for(k = 0; k < M; k++)

{

for( i = 0; i < N; i++)

{

tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];

}

}

My problem is… if f is:

[0,1,2,3]

[0,1,2,3]

Then, when i try to update tau, it will try to update tau[0][1] twice, then tau[1][2] twice and go. That’s why i need syncronization.

L is a vector -> size M, and Q a constant.

ok, I see what you mean. The algorithm as it is now is not suitable for parallel computing. However, you can re-struct it a bit. E.g. just a naive solution (depends how much varies numbers in f and what is its relation to M,N).

run 2Drange kernel
int x = get_global_id(0);
int y = get_global_id(1);

if ( f[k][i] != x || f[k][i+1] != y ) return;
tau[y] += Q/L;

so now you have warranty that each matrix tau’s element is correctly handled, with only one thread.

ok, I see what you mean. The algorithm as it is now is not suitable for parallel computing. However, you can re-struct it a bit. E.g. just a naive solution (depends how much varies numbers in f and what is its relation to M,N).

run 2Drange kernel
int x = get_global_id(0);
int y = get_global_id(1);

if ( f[k][i] != x || f[k][i+1] != y ) return;
tau[y] += Q/L;

so now you have warranty that each matrix tau’s element is correctly handled, with only one thread.

Hm, i get it. But how i define the k and the i? k is the number of the solution (M solutions), and i is the size of each solution (size N+1). You propose to create an NxN kernel using x and y, but i still need this two itens.

Hm, i get it. But how i define the k and the i? k is the number of the solution (M solutions), and i is the size of each solution (size N+1). You propose to create an NxN kernel using x and y, but i still need this two itens.

sorry, my mistake - it’s not the right way to parallelize the problem.

Hm, what about precomputing double number frequencies in f (f[k][i] and f[k][i+1] ) and then run every thread for such double number. Instead of having tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k]; you will be setting += Q/L[k]* occurrences ?

sorry, my mistake - it’s not the right way to parallelize the problem.

Hm, what about precomputing double number frequencies in f (f[k][i] and f[k][i+1] ) and then run every thread for such double number. Instead of having tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k]; you will be setting += Q/L[k]* occurrences ?