Kernel sincronization problem

bractus · November 14, 2010, 8:33pm

I need to perform this code in opencl.

for(i = 0; i < M; i++)
{
for( j = 0; j < N; j++)
{
tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];
}
}

Where f is a Mx(N+1) matrix and tau is a NxN matrix.

When I set the range to M x N and try to perform the computation, i need to do sincronization or an operation can override the other. I tried to use the atom_add but Nvidia implementation just allowed for int and long. I need float.

What I have to do?

bractus · November 14, 2010, 8:33pm

I need to perform this code in opencl.

for(i = 0; i < M; i++)
{
for( j = 0; j < N; j++)
{
tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];
}
}

Where f is a Mx(N+1) matrix and tau is a NxN matrix.

When I set the range to M x N and try to perform the computation, i need to do sincronization or an operation can override the other. I tried to use the atom_add but Nvidia implementation just allowed for int and long. I need float.

What I have to do?

karbous · November 14, 2010, 9:06pm

Why you need the synchronization? Is the matrix tau indexed dependent on previous write? What’s k - variable or constant? Please clarify your code a bit more, from first point of view there seems no need of synchronization.

If there is really a need to have previous values of input matrices make 2 identical inputs - one for read and one for write. Or another options is to end the whole kernel and start the computation once again. There is no barrier for all threads, just for a block of threads.

karbous · November 14, 2010, 9:06pm

Why you need the synchronization? Is the matrix tau indexed dependent on previous write? What’s k - variable or constant? Please clarify your code a bit more, from first point of view there seems no need of synchronization.

If there is really a need to have previous values of input matrices make 2 identical inputs - one for read and one for write. Or another options is to end the whole kernel and start the computation once again. There is no barrier for all threads, just for a block of threads.

bractus · November 14, 2010, 11:41pm

Sorry, the code was wrong. There is the right one:

for(k = 0; k < M; k++)

{

for( i = 0; i < N; i++)

{

tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];

}

My problem is… if f is:

[0,1,2,3]

Then, when i try to update tau, it will try to update tau[0][1] twice, then tau[1][2] twice and go. That’s why i need syncronization.

L is a vector → size M, and Q a constant.

bractus · November 14, 2010, 11:41pm

Sorry, the code was wrong. There is the right one:

for(k = 0; k < M; k++)

{

for( i = 0; i < N; i++)

{

tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k];

}

My problem is… if f is:

[0,1,2,3]

Then, when i try to update tau, it will try to update tau[0][1] twice, then tau[1][2] twice and go. That’s why i need syncronization.

L is a vector → size M, and Q a constant.

karbous · November 15, 2010, 8:20am

ok, I see what you mean. The algorithm as it is now is not suitable for parallel computing. However, you can re-struct it a bit. E.g. just a naive solution (depends how much varies numbers in f and what is its relation to M,N).

run 2Drange kernel
int x = get_global_id(0);
int y = get_global_id(1);

if ( f[k][i] != x || f[k][i+1] != y ) return;
tau[y] += Q/L;

so now you have warranty that each matrix tau’s element is correctly handled, with only one thread.

karbous · November 15, 2010, 8:20am

ok, I see what you mean. The algorithm as it is now is not suitable for parallel computing. However, you can re-struct it a bit. E.g. just a naive solution (depends how much varies numbers in f and what is its relation to M,N).

run 2Drange kernel
int x = get_global_id(0);
int y = get_global_id(1);

if ( f[k][i] != x || f[k][i+1] != y ) return;
tau[y] += Q/L;

so now you have warranty that each matrix tau’s element is correctly handled, with only one thread.

bractus · November 15, 2010, 2:26pm

Hm, i get it. But how i define the k and the i? k is the number of the solution (M solutions), and i is the size of each solution (size N+1). You propose to create an NxN kernel using x and y, but i still need this two itens.

bractus · November 15, 2010, 2:26pm

Hm, i get it. But how i define the k and the i? k is the number of the solution (M solutions), and i is the size of each solution (size N+1). You propose to create an NxN kernel using x and y, but i still need this two itens.

karbous · November 15, 2010, 2:55pm

sorry, my mistake - it’s not the right way to parallelize the problem.

Hm, what about precomputing double number frequencies in f (f[k][i] and f[k][i+1] ) and then run every thread for such double number. Instead of having tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k]; you will be setting += Q/L[k]* occurrences ?

karbous · November 15, 2010, 2:55pm

sorry, my mistake - it’s not the right way to parallelize the problem.

Hm, what about precomputing double number frequencies in f (f[k][i] and f[k][i+1] ) and then run every thread for such double number. Instead of having tau[ f[k][i] ][ f[k][i+1] ] += Q/L[k]; you will be setting += Q/L[k]* occurrences ?