 i have a linear solver which is basically a gauss siedel equation:
i am having a 2D matrix and each element depend on 4-neighbourhoods.

A[i,j]= A[i-1]A[j] + A[i]A[j+1] + A[i+1]A[j] + A[i]A[j-1].

And to newly compute A[i,j] it depends on new values of A[i-1]A[j] and A[i]A[j+1].
so my iteration is complete sequential.

now how can i use atomics to increase the performance??

“now how can i use atomics to increase the performance?”

how big is the array?

you could set your kernel dimensions such that you parallelize the element calculation as much as possible
(if you use an input and output array/ matrix)

if you iterate, you could use streams and forward issue work at least 1 iteration ahead
theoretically, the 2nd iteration’s row n - 1 can already commence, once the 1st iteration’s row n is complete
streams would help to ensure both concurrency and synchronization

hii jimmy
The size is 78*90 (actually an image).
i am not geting the stream idea, if i am correct are you telling about pipe-lining it??

i have no idea what ‘pipe-lining it’ means

the size is somewhat small, nevertheless:

assume that, with your kernel dimensions, you manage to cover a row (all columns), and a number of rows
so, if you take the matrix and divide it into a number of sub-matrices - each a row multiple - you can start kernels working in on the sub-matrices as row-multiples, sequentially
but this is only if you iterate

assume you have a matrix of 75 rows; assume a kernel covers 5 rows
you can parallel expose the matrix with 15 kernels (not that parallel exposing - ‘covering’ - the matrix with only 1 kernel is not possible)
with iterations, you can exploit the fact that a) each iteration has row operations, b) the n row operation of the next iteration can start once the n + 1 row operation of the current iteration is complete
this way, you would be able to parallel expose row operations of iterations, instead of merely row operations of an iteration, for each iteration
you would require streams to achieve and implement both the required concurrency and synchronization

below an elementary demonstration of a 10 row matrix, with a row multiple of 5, with 2 iterations issued in 2 streams - S1, S2
it is taken that the row dependency is simply n, not n - 1, n + 1 as in your case

kernel(iteration 1: rows 0 - 4) S1
kernel(iteration 1: rows 5 - 9) S2
kernel(iteration 2: rows 0 - 4) S1
kernel(iteration 2: rows 5 - 9) S2

hii,thanks a lot…i will be implementing it and if i have some doubts will ask you :)