Loop confusion

I am doing some operations on a matrix and based on whether we are dealing with the lower half or upper half, I apply a different formula. I am trying to move from one end of the matrix to the other starting at the end and moving toward the beginning. Unfortunately, my debug output shows that no matter what I do, I am moving from left to right on the matrix.

The code below just shows an example:


rowctr = 30;

int row = rowctr * blockDim.x + threadIdx.x;

for(rowctr=30; rowctr>=0;rowctr–)


row = rowctr * blockDim.x + threadIdx.x;

if(blockIdx.x == rowctr && threadIdx.x < rowctr)


printf("%d\t%d\t%d\n", rowctr, threadIdx.x, row);




This prints out the correct matrix index for each column of the matrix - however, instead of starting at column 30, it is starting at column 1. I’ve tried using __synchthreads() just to see if it was a bug in the device emulator, but nothing seems to print column 30 first.

Any thoughts?

FYI - I’m calling my kernel by:


solver<<n,n>>(<device pointers); // n = 30


I know this isn’t the most efficient use of CUDA, but I have data dependency issues, so I need to go through the matrix column by column.

Thanks for any help!

What you are seeing is the order the device emulator chooses to run your blocks in.

Why do you start different blocks and then do work in only on of them? Just loop over the work in one block only, and you will be able to determine the order of execution yourself.

I have a data dependency issue that requires me to operate on the matrix in a column by column fashion. I can’t just apply the equation to the entire matrix in one swoop. Essentially, I take a column of the matrix, multiply it by a vector, and use the results for the next column.

I’m not sure I know what you mean by just loop over the work in one block only…

Thanks for your reply. I appreciate it.

GPU can perfrom blocks in any order.

You already have the loop over rowctr. Instead of scheduling 31 blocks and doing stuff only when rowctr==blockIdx.x, you could as well just schedule one block and remove that if-condition.

This might require splitting your kernel into several. But that split cannot be avoided, as the order in which blocks are executed is undefined (as Lev already mentioned). The only inter-block synchronization is through starting new kernels.