Hi,

I have an N x N square matrix of integers (which is stored in the device as a 1-d array for convenience).

I’m implementing an algorithm which requires the following to be performed:

There are 2N anti diagonals in this square. (anti - diagonals are parallel lines from top edge to left edge and right edge to bottom edge)

I need a for loop with 2N iterations with each iteration computing one anti-diagonal starting from the top left and ending at bottom right.

In each iteration, all the elements in that anti-diagonal must run parallelly.

Each anti-diagonal is calculated based on the values of the previous anti-diagonal.

So, how do I index the threads with this requirement in CUDA?

Please let me know if I’m not clear. I implemented this code in CUDA and the results are correct if N < 15, but if I use an N value of say 1000, I’m getting an error value of 8%.

If you are interested I can post my code which works for small matrices.

Thanks…