 # Parallel Anti diagonal 'for' loop ?

Hi,

I have an N x N square matrix of integers (which is stored in the device as a 1-d array for convenience).

I’m implementing an algorithm which requires the following to be performed:

There are 2N anti diagonals in this square. (anti - diagonals are parallel lines from top edge to left edge and right edge to bottom edge)

I need a for loop with 2N iterations with each iteration computing one anti-diagonal starting from the top left and ending at bottom right.

In each iteration, all the elements in that anti-diagonal must run parallelly.

Each anti-diagonal is calculated based on the values of the previous anti-diagonal.

So, how do I index the threads with this requirement in CUDA?

Please let me know if I’m not clear. I implemented this code in CUDA and the results are correct if N < 15, but if I use an N value of say 1000, I’m getting an error value of 8%.

If you are interested I can post my code which works for small matrices.

Thanks…

Write code for normal row-wise processing and tilt your graphics card by 45 degrees when it starts executing your kernel!!! lol…

Hi…

Here is my code:

Say the matrix in question is an (N+1) x (N+1) matrix. So, there are 2N anti diagonals leaving the first and the last elements which are single element digonals.

``````int index = blockDim.x * blockIdx.x + threadIdx.x;

// loop for the first N anti diagonals (top edge to left edge)

for (i=1; i <= N; i++)

{

if (index%N == i%N && index <= i*(N+1))

{

loopi = index/(sizeA+1);

loopj = index%(sizeA+1);

// some calculation using matrix [index-1] and matrix[index-N-1]

}

}

// loop for the second N anti diagonals (right edge to bottom edge)

for (i=1; i <= N; i++)

{

if (index%N == i%N && index > i*(N+1))

{

loopi = index/(sizeA+1);

loopj = index%(sizeA+1);

// some calculation using matrix [index-1] and matrix[index-N-1]

}

}
``````

My problem is, the answers are matching for the first triangle (the top one), but for the 2nd set of N anti diagonals there are errors scattered here and there.

I’m unable to understand why only a few elements in the 2nd set are wrong. Is there something wrong with having 2 for loops this way…

Is it any synchronization issues ? or am I indexing the threads incorrectly ?

Thanks

I’m thinking that such a requirement can not be satisfied by a single kernal invocation.We need a separate kernal for each anti diagonal…

But, does multiple kernal invocations slow down the execution time ??

I got 125x performance with 8800GTX (1.35GHz with 128 cores) against 2.41GHz AMD Athlon with multiple kernel invocation.

My kernel is a financial algorithm with loads and loads of computation.

btw, “%” operator is a very very very costly operation – Thats what I recollect from some1 (Wumpus?) post. CHeck out CUDA manual also.

So, my first suggestion is “Try to minimize the usage” by caching the value in a shared memory or local variable!

The compiler may NOT do common-subexpression evaluation – possible… Check out…

Hi,

thanks for the replies…
My multiple kernal invocation code is working perfectly now.
but if I run using large arrays like 1000 by 1000, I’m getting an ‘unspecified launch error’ at the line where the output array is copied from device to host.

Can you please tell me what could possibly be the problem with larger arrays ?

Cheers…

do you have an CUT_CHECK_ERROR after your kernel launch? The error means your kernel did not launch correctly. And only when you want to copy the data you find out. Reasons can be :

• too many threads requested for the amount of registers per thread
• too many blocks requested
• overwriting data
• hitting the 5 sec. limitation.

thanks…

Its not the 5 sec limitation.

I want to use only a fraction of the threads present in the last block. But in my kernal code I’m not putting any restriction on the thread indices regarding this matter.

So, is it possible that there is some overwriting of data or something?

I found my problem.
I’m getting that error if I’m having more than 1 block. In my code my index exceeds the array size due to the excess threads spawned in the last block.

But if I put a restriction that these threads should not ececute, then my final answer is wrong.

Where am I making a mistake ??

Thanks…