QR Factorization, Non-Synchronous Thread starts

Hi there,

So I’m attempting to implement a parallel version of a QR (orthoganal basis*upper triangular) factorization using Givens Rotations as they are more easily parallelized. Now for those unfamiliar with this method, it basically involves zeroing the lower triangular portion of a matrix one entry at a time starting with the bottom left, and iterating up the first column. This method can be parallelized as each zeroing operation only effects the row in which we are zeroing and the row above. Thus after we have zerod the first two entries in the first column, a second thread can start on the second column etc.

Now the issue comes in as the second thread must wait until the first has finished its first two operations, similarily the third thread waits until the second has worked over its first two operations. In my feeble CUDA attempt I have tried something along the following lines:


global void col_rotations(float *a, int lx, int ly)


int idx = threadIdx.x;

int i=0;

int j;

    // The idea here is that a thread with ID 0 is let through immediately, thread ID 1 must wait for 0 to __syncthreads() again

while ( i<idx )





// Perform first two given rotations on column = idx.

syncthreads(); // I was hoping this would let the next thread start.

// Finish given rotations for column.


The idea here is that threads are held up until the thread in front has finished its rotations. However the weakness in this method is very apparent (assuming I understand how syncthreads() works, all threads waiting at __syncthreads() are blocked until every thread has reached a sync_threads?).

Is there an easier way to setup non-synchronous starts for a problem of this type?

Any help is appreciated, and sorry for the long-winded question, just trying to make it clear!

Check out vvolkov’s implementation:


I haven’t had a chance to look over his code yet, but I believe it is going to be incorporated into the next release of CUBLAS, so I’m sure that it’s probably the best implementation around.