divergent codepath

Hello, when I run the following kernel, the CUDA profiler reports that there’s a lot of divergence in the threads, and so my instruction throughput is only 3% of the theoretical max.

__global__ void calcExG(unsigned char *input, float *exG, float *exGInc)


	int i=blockIdx.x*CUDAGRNUMTHREAD+threadIdx.x;



where “input” is an array copied from the host to CPU, and has boolean values (0 and 1’s only). The values are fairly random, and there is no way I can organize them to be not random. So my question is, why is this code divergent? I assume the divergence happens at


because of the random values in “input”. So I’m wondering, why is the codepath divergent for it, and is there a way I can make this non-divergent? Thanks.

There should be no divergence in the code you show. Moreover, the memory coalescing should be perfect on compute capability >=1.2 devices if CUDAGRNUMTHREAD == blockDim.x. (the unsigned char will not read coalesced on compute capability 1.1 and earlier due to memory controller limitations) The random contents of input don’t matter because you are reading them in thread order and just performing a multiplication.

Which profiler counter are you referring to exactly? “divergent_branch”?

Yeah, the divergent_branch was the counter, and CUDAGRNUMTHREAD should be the same size as blockDim.x, since my kernel call is the following:


. Speaking of which, I should be using blockDim.x anyways.

I’m going to double check the profile results, maybe the fault lies in another kernel. Thanks for the help.