I am doing some operations on a matrix and based on whether we are dealing with the lower half or upper half, I apply a different formula. I am trying to move from one end of the matrix to the other starting at the end and moving toward the beginning. Unfortunately, my debug output shows that no matter what I do, I am moving from left to right on the matrix.
The code below just shows an example:
rowctr = 30;
int row = rowctr * blockDim.x + threadIdx.x;
row = rowctr * blockDim.x + threadIdx.x;
if(blockIdx.x == rowctr && threadIdx.x < rowctr)
printf("%d\t%d\t%d\n", rowctr, threadIdx.x, row);
This prints out the correct matrix index for each column of the matrix - however, instead of starting at column 30, it is starting at column 1. I’ve tried using __synchthreads() just to see if it was a bug in the device emulator, but nothing seems to print column 30 first.
FYI - I’m calling my kernel by:
solver<<n,n>>(<device pointers); // n = 30
I know this isn’t the most efficient use of CUDA, but I have data dependency issues, so I need to go through the matrix column by column.
I have a data dependency issue that requires me to operate on the matrix in a column by column fashion. I can’t just apply the equation to the entire matrix in one swoop. Essentially, I take a column of the matrix, multiply it by a vector, and use the results for the next column.
I’m not sure I know what you mean by just loop over the work in one block only…
You already have the loop over rowctr. Instead of scheduling 31 blocks and doing stuff only when rowctr==blockIdx.x, you could as well just schedule one block and remove that if-condition.
This might require splitting your kernel into several. But that split cannot be avoided, as the order in which blocks are executed is undefined (as Lev already mentioned). The only inter-block synchronization is through starting new kernels.