I have an N x N square matrix of integers (which is stored in the device as a 1-d array for convenience).
I’m implementing an algorithm which requires the following to be performed:
There are 2N anti diagonals in this square. (anti - diagonals are parallel lines from top edge to left edge and right edge to bottom edge)
I need a for loop with 2N iterations with each iteration computing one anti-diagonal starting from the top left and ending at bottom right.
In each iteration, all the elements in that anti-diagonal must run parallelly.
Each anti-diagonal is calculated based on the values of the previous anti-diagonal.
So, how do I index the threads with this requirement in CUDA?
Please let me know if I’m not clear. I implemented this code in CUDA and the results are correct if N < 15, but if I use an N value of say 1000, I’m getting an error value of 8%.
If you are interested I can post my code which works for small matrices.