Parallel Anti diagonal 'for' loop

Hi,

I have an N x N square matrix of integers (which is stored in the device as a 1-d array for convenience).

I’m implementing an algorithm which requires the following to be performed:

There are 2N anti diagonals in this square. (anti - diagonals are parallel lines from top edge to left edge and right edge to bottom edge)

I need a for loop with 2N iterations with each iteration computing one anti-diagonal starting from the top left and ending at bottom right.

In each iteration, all the elements in that anti-diagonal must run parallelly.

Each anti-diagonal is calculated based on the values of the previous anti-diagonal.

So, how do I index the threads with this requirement in CUDA?

Please let me know if I’m not clear. I implemented this code in CUDA and the results are correct if N < 15, but if I use an N value of say 1000, I’m getting an error value of 8%.

If you are interested I can post my code which works for small matrices.

Thanks…

It is not quite clear to me what the desired access pattern is. Is it like in the following diagram (where the numbers denote elements visited in the same iteration):

//   0 1 2 3 4  ...
//   1 2 3 4 5  ...
//   2 3 4 5 6  ...
//   3 4 5 6 7  ...
//   4 5 6 7 8  ...
//   : : : : :

That uses 2N-1 steps though, not 2N as you state, so I am not sure this is the correct pattern. How are threads assigned to array elements? If you use an NxN thread block for example, where each thread handles one matrix element, with the matrix stored in shared memory, you could use the following to achieve the above access pattern (untested code):

int tx = threadIdx.x;
int ty = threadIdx.y;
for (int step = 0; step < 2*N-1; step++) {
    i = tx;
    j = step - i;
    if ((tx <= step) && (ty == j)) { /* do work on A(i,j) */ }
    __syncthreads();
}

Note that with a maximum of 1024 threads per thread blcck, such a scheme would be able to handle only up to N=32 (likewise you would run out of shared memory for much larger matrices, as there are at most 48 KB of shared memory available per thread block).

Can you describe your current approach?

Thanks for the reply.
My approach also looks like same as yours. But it is not working as per the expect

I need the pattern as below

1.// 0 1 2 3 4 …
2.// 1 2 3 4 5 …
3.// 2 3 4 5 6 …
4.// 3 4 5 6 7 …
5.// 4 5 6 7 8 …
6.// : : : : :

In each iteration, all the elements in that anti-diagonal must run parallelly.
Each anti-diagonal is calculated based on the values of the previous anti-diagonal.

Means All the 3rd diagonal elements (2,2,2) has to run parallel and to calculate those elements it depends on previous diagonal (1,1).

for my requirement matrix size >10000.
Can you please share the how the Kernel code should be ?
And the Kernel calling mechanism for large data.

Please help me .this is very urgent req

As long as I understand, you want something like http://heather.cs.ucdavis.edu/~matloff/158/cudasw.pdf. At each iteration, the kernel is launched with a different number of threads.

Perhaps the above code by njuffa could be modified as

int iDivUp(const int a, const int b) { return (a % b != 0) ? (a / b + 1) : (a / b); };

#define BLOCKSIZE 32

__global__ antiparallel(float* d_A, int step, int N) {

	int i = threadIdx.x + blockIdx.x* blockDim.x;
	int j = step-i;

	/* do work on d_A[i*N+j] */

}

for (int step = 0; step < 2*N-1; step++) {
	dim3 dimBlock(BLOCKSIZE);
	dim3 dimGrid(iDivUp(step,dimBlock.x));
	antiparallel<<<dimGrid.x,dimBlock.x>>>(d_A,step,N);
}

This code is not tested and is just a sketch of a possible solution (provided that I have not misunderstood your question). Furthermore, I do not know how efficient would be a solution like that since you will have kernels launched with very few threads.

Actually my Requirement is DTW Parallel using CUDA,