Bug in QR decomposition code help :(

Hi all,

I’m working on a CUDA program that calculates QR decomposition of m by n matrix. Here’s the kernel/helper functions

// Kernel Function for Q matrix calculation

__global__ void subQ(float* A, float* Q, unsigned int* track, int m, int n, float* debug){

	register unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

	register unsigned int j = blockDim.y * blockIdx.y + threadIdx.y;

	register unsigned int k, l;

	if (i < m && j < n){

		// Base case for the first column

		if (j == 0) {

			Q[i * n] = A[i * n] / magn(A, 0, (m-1)*n, n);


		// Algorithm for all subsequent columns

		else {

			// First assign the original value from input matrix into Q

			Q[i * n + j] = A[i * n + j];


			// Iterates through each previous columns and subtract the corresponding values

			// Since Wx = Vx - <Vx,Q1>Q1 - <Vx,Q2>Q2... - <Vx,Q(x-1)>Q(x-1)

			for (k=0; k < j; ++k){

				// Loop that utilizes the track matrix to wait until all previous calculations are complete

				for (l=0; l < m; ++l){

					while (track[l * n + k] != 2){



				if (j==3 && k==1)

					debug[i] = Q[i * n + k];

				// Perform appropriate <Vx,Qy>Qy calculation and subtract it from Wx.

				Q[i * n + j] -= (dotProduct(A, Q, j, (m-1)*n+j, k, (m-1)*n+k, n) * Q[i * n + k]);


			// Change the track number to -1 to indicate that the number has finished Wx calculations

			track[i * n + j] = 1;

			// Loop that waits until all numbers in the same column has completed Wx calculations

			for (l=0; l<m; ++l){

				while (track[l * n + j] != 1){



			// Final calculation for Qx, since Qx = Wx / ||Wx||

			Q[i * n + j] /= magn(Q, j, (m-1)*n+j, n);


		// Change track number to 2 to indicate that the calculations are complete

		track[i * n + j] = 2;



// Helper function for dot product

__device__ float dotProduct(float* A, float* B, unsigned int a_beg, unsigned int a_end, unsigned int b_beg, unsigned int b_end, unsigned int indent){

	register float temp2 = 0;

	register unsigned int a = a_beg;

	register unsigned int b = b_beg;

	for (;a<=a_end && b <= b_end;){

		temp2 += A[a] * B[b];

		a += indent;

		b += indent;


	return temp2;


// Helper function for magnitude calculations

__device__ float magn(float* A, unsigned int beg, unsigned int end, unsigned int indent){

	register float temp3 = 0;

	for (unsigned int b=beg; b<=end; b += indent)

		temp3 += A[b] * A[b];

	return sqrt(temp3);


Since each column requires all previous columns to finish their Q calculations, I use a tracking matrix that will change the number to 1 once it has finished W calculation, then to 2 once it has finished Q calculation (Q = W / ||W||).

I tested the code using an example given on wikipedia, where the input matrix is

12, -15, 4

6, 167, -68

-4, 24, -41

the resulting Q matrix should be

6/7 -69/175 -58/175

3/7 158/175 6/175

-2/7 6/35 -33/35

The first two columns of my result match the correct answer, but the third column does not. After some debugging, I found out that the third column is using values -69, 158, and 30 from second column instead of the correct values, so it’s using the values before they were divided by the magnitude of the column, which doesn’t make sense because the track matrix already marked through index as ‘2’, meaning that it has finished its calculations.

Any help :(?


you may need synchronizations

Are you doing householder reflections?

Have you thought about data locality in this implementation?

If I’m understanding correctly you have allocated one thread per element? If you’re doing Housholder I would suggest having maybe total #threads = N_rows and let these threads iterate from column zero to column N-1 and add appropriate synchronization points. These points might be consecutive kernel calls or __syncthreads within the blocks.

This “Loop that utilizes the track matrix to wait until all previous calculations are complete” seems like risky business…

I’m using the gram-schmidt method, which seems to be the most straight forward one, I’ll read more into Householder reflection to see if that’s easier to program.

After further debugging, the loop for track matrix seems to be the problem. When I comment out both while loops, I still get the same results :( so it seems that the program isn’t waiting for the previous columns to finish its calculations. Anyone has any idea why? Theoretically wouldn’t the later threads get stuck in the while loop as long as the track matrix does not equal to 2?

No. mianlu is correct that you would need synchronization. At the very least, you would need to define the tracking matrix as volatile, so that the compiler does not rearrange or optimize out the memory transactions. However, volatile is an outdated and insufficient mechanism as there are also no hardware guarantees about inter-block communication. So instead of using volatile, it would be better to insert actual memory barriers.

Note that with proper synchronization, you might run into deadlock situations. Even if you don’t, there is no guarantee in CUDA that this will work in the future as there is no guarantee about the the order in which blocks are executed. Within blocks, you can obtain some guarantees using __syncthreads().

All in all, attempts to introduce inter-block synchronization into CUDA should be avoided. It is just not a good match for the way GPUs work. Try to rearrange the code so that dependencies are kept local within the same thread or block.

So, assuming that all the operations are occurring within one block for now, I changed the while loop to

while (((volatile unsigned int)&track[…] != 2)

and I added track[i * n + j] = 0 just after the first “if” statement.

However, now my computer seems to freeze up when I try to run the program, so the threads are stuck in the while loop. I tried using __syncthreads(), didn’t really make any difference (my testing matrix is only 3x3, and they are in the same warp, so they should be synchronized).

I can’t figure out why the latter 2 columns would be stuck in loop, since the track number for first column should be changed to 2 almost instantaneously. Any suggestions?

P.S. thanks for all the help so far, by the way!

P.P.S. Oh, tera, is it possible if you can explain “insert actual memory barriers” more? Thanks!

That is what I meant with ‘deadlocks’.
In this particular case, the only warp that is executing will first wait infinitely in the while loop, before any work could actually be done.

These kind of ‘clever’ synchronization schemes are extremely hard (sometimes impossible) to get right. The only advice I can offer you is to give up on the idea of a track matrix. Reorganize the kernel so that dependent work is done within the same thread (or block, with proper __syncthreads() calls in between). If dependencies between blocks cannot be avoided, use multiple kernel invocations where blocks within each invocation are still independent.

For the matter of memory barriers, read Appendix B.5 “Memory Fence Functions” of the CUDA Programming Guide.
EDIT: These won’t solve the problem with deadlocks, though.