Block Divergence

Hey,

I am relatively new to Cuda. I need help with diverging blocks to do different things. In the kernel, I want one block to execute one particular set of instructions and the others to do something else.

Currently, I have 2 blocks with varying number of threads in each. By varying I mean the number of threads in a block is read from a file for different instances of the problem I am working on. But the number of threads in each block are equal.

So, the most obvious way for me was to use the if statement. This is what I do in my kernel:

__global__ void abc(args)

{

if (blockIdx.x==0)

{

i = threadIdx.x;

......

......

.....

}

else if(blockIdx.x==1)

{

i = threadIdx.x;

.....

....

....

}

}

Actually, I want the i in each if statement to start from zero and increase. This is because ‘i’ will later on have to access different arrays. So, once the kernel checks for the blockid, then will setting i=threadIdx.x give me i=0,1,2,3,…? does this mean that the i’s in the if statement and in the else statement are different or am i waiting for the the i’s from if statement to finish and then come and run the else if statement? Also, for this problem, I have threadsPerBlock=16, blocksPerGrid=2 and the kernel is called as abc<<<blocksPerGrid, threadsPerBlock>>>(args);

Is what I’m doing totally invalid? Any help is more than welcome.

threadIdx.x is a counter which tells you the thread number within your block. So for every block the threads are labeled from 0 to (blockDim.x-1).
Your approach for diverging your code between blocks is also perfectly fine and except for that additional if statement you won’t suffer any other penality. You may also freerly use __syncthreads() in each of the branches, because whole block is executing one or another branch.

Unfortunately each block will have the same number of threads. If for given problem you need less threads, you can set up a bool variable ‘threadWorking’ and make use of less threads, branching execcive ones out. Keep in mind that __syncthreads() have to be hit by all threads of the block though.

What is i in your code? Is it a local or global variable?

i in my code is a local variable. Once it has been set to the threadId, I use it to access various arrays of structures which are passed onto the kernel. But I’m not able to get anything working. Typically, I am using ‘i’ to access more than one array of structures and i want each thread to go to exactly that position. For example, I want thread 0 of block 0 to access the 0th element of one array and do some calculation and compare it to the 0th element of another array and do some other calculation. Similarly, I want thread 1 of block 0 to access the 1st element of one array and do some calculation and compare it to the 1st element of another array and do some other calculation.

And I want to mimic this behaviour for the second block as well. The blocks are working on different arrays though. But for their calculations, they access one common array and update it as well. I know race conditions are bound to occur due to this, but my program does not have stringent accuracy constraints.

I have no idea what I am doing wrong here…

In theory what you say should work. There must be some bug…
if the code is not too long maybe you could post it over here?

the code is approx 400 lines long. Should I post it?

Maybe try to reduce its size and localise to the problem that seems not to work.