QR Factorization, Non-Synchronous Thread starts

takeshi · March 13, 2009, 6:05pm

Hi there,

So I’m attempting to implement a parallel version of a QR (orthoganal basis*upper triangular) factorization using Givens Rotations as they are more easily parallelized. Now for those unfamiliar with this method, it basically involves zeroing the lower triangular portion of a matrix one entry at a time starting with the bottom left, and iterating up the first column. This method can be parallelized as each zeroing operation only effects the row in which we are zeroing and the row above. Thus after we have zerod the first two entries in the first column, a second thread can start on the second column etc.

Now the issue comes in as the second thread must wait until the first has finished its first two operations, similarily the third thread waits until the second has worked over its first two operations. In my feeble CUDA attempt I have tried something along the following lines:

[codebox]

global void col_rotations(float *a, int lx, int ly)

{

int idx = threadIdx.x;

int i=0;

int j;

    // The idea here is that a thread with ID 0 is let through immediately, thread ID 1 must wait for 0 to __syncthreads() again

while ( i<idx )

{

	__syncthreads();

	i++;

}



// Perform first two given rotations on column = idx.

syncthreads(); // I was hoping this would let the next thread start.

// Finish given rotations for column.

}[/codebox]

The idea here is that threads are held up until the thread in front has finished its rotations. However the weakness in this method is very apparent (assuming I understand how syncthreads() works, all threads waiting at __syncthreads() are blocked until every thread has reached a sync_threads?).

Is there an easier way to setup non-synchronous starts for a problem of this type?

Any help is appreciated, and sorry for the long-winded question, just trying to make it clear!

jack · March 13, 2009, 7:16pm

Check out vvolkov’s implementation:

[url=“http://forums.nvidia.com/index.php?showtopic=89084&hl=”]http://forums.nvidia.com/index.php?showtopic=89084&hl=[/url]

I haven’t had a chance to look over his code yet, but I believe it is going to be incorporated into the next release of CUBLAS, so I’m sure that it’s probably the best implementation around.

Topic		Replies	Views
Stop other threads from executing parallelly CUDA Programming and Performance	10	786	February 2, 2020
Synchronization across all threads CUDA Programming and Performance	9	6601	August 22, 2008
A __syncthreads question CUDA Programming and Performance	6	1209	January 10, 2011
Multiple small matrix multiplication program structure CUDA Programming and Performance	18	7632	April 18, 2010
Bug in QR decomposition code help :( CUDA Programming and Performance	7	8642	July 14, 2010
Matrix multiplication CUDA CUDA Programming and Performance	7	2920	November 12, 2012
Secuential Access to CUDA CUDA Programming and Performance	3	1383	July 2, 2009
Inconsistent kernel run times CUDA Programming and Performance	12	5801	August 5, 2009
CUDA parallelization fail..? CUDA Programming and Performance	3	3372	June 8, 2008
Branch Divergence CUDA Programming and Performance	3	1252	April 25, 2012

QR Factorization, Non-Synchronous Thread starts

Related topics