__device__ functions create divergent branches


my kernel code makes use of a device function to calculate gaussian coefficients. Profiling that kernel, I can see that this kernel produces an instruction replay overhead of about 20% due to divergent branches. Removing the device function (by manually inlining it), the kernel does not have any divergent branches. Why is that?

How are device functions actually called? Why could that lead to any branches?


It would help if we could see the code for the device function. Two potential reasons (off the top of my head, there may be others):

(1) the device function contains branches (if-then-else, loops, etc)
(2) the device function call a math function that contains branches


Hm, I see that you state that manually inlining the function removed the divergence. That would suggest that there may be a branch upstream that cannot be removed as long as there is a function call, but after inlining the compiler is able to if-convert that branch, e.g. by using predication. Without having compilable code one can only speculate. You could disassemble the binary with cuobjdump to see where branches are being used in the generated code.

What happens if you use forceinline on the device function, instead of manually inlining it?

I apologize for my late reply.

The device function is incredibly simple:

__device__ float __getGaussCoeffs( int x, float deviation )


	return expf(-0.5f * x * x / deviation / deviation);


The Kernel code contains a for loop whose number of iterations is constant, and an if conditional whose condition is

y >= 0 && y < height && x - kernel_radius >= 0 && x + kernel_radius < width

where x and y are the coordinates of the thread in relation to the grid.

I’ve not been able to profile the kernel with forceinline yet, but executing the code, forceinline does not affect the runtime.

Looking at the disassembly of my ptx code, how can i detect a divergent branch?

Does expf result in a branch? If so, can i take it as a rule of thumb to avoid expf (or branching intrisics in general) within device functions?



PS: In case it is of interest, i’m using CUDA 4.1 on cards with capability 1.0, 1.3 and 2.0

Do you see the same behavior with CUDA 4.0?

Yes, i did.

The findings from the original post were based on 4.0.

Performance side note. I would suggest replacing the dual division

return expf(-0.5f * x * x / deviation / deviation);

with a division by the square of “deviation”, unless the square overflows the single-precision range:

return expf(-0.5f * x * x / (deviation * deviation));

The compiler cannot make that transformation automatically as it is not a identity transformation (e.g. consider x = 1e19, deviation = 1e25).

Without seeing the complete code, and without knowing how that code is being compiled, I can only speculate where possibly divergent branches exist in the code. In particular, one would have to study the machine code by looking at the disassembly produced by cuobjdump. A non-uniform branch can diverge, whether it will in fact diverge depends on the data encountered by the branch. In the code snippet shown, the comparisons that involve kernel_radius look like data-dependant branches that are potentially divergent.

If I remember correctly, expf() is a branchless code sequence at the machine code level across all architectures. For sm_2x and higher, the default (IEEE-754 compliant) single-precision division is a called subroutine that contains a branch to separate out special cases. That data-dependent branch will be divergent if both regular and special cases are encountered within the same warp. Special cases involve operands at the very ends of the exponent range, i.e. close to underflow and close to overflow, zero, infinity, NaN.

With the SIMT architecture of the GPU, possible divergence is a fact of life, and the compiler typically handles this appropriately via a combination of predication, select-type instructions, uniform branches, and regular branches. Trying to avoid divergence does not necessarily improve performance. For example, avoiding divergence if-then-else by computing both paths and selecting the result at the end with a ternary operator ensures that all threads incur the cost of executing both sides, so no performance is gained.

Well, here is my complete kernel code code:

__global__ void filter(const int width, const int height, unsigned char * image, 

                       const size_t image_pitch, unsigned char * result, size_t result_pitch) {

int x = blockIdx.x * blockDim.x + threadIdx.x;

	int y = blockIdx.y * blockDim.y + threadIdx.y;

	float l_sum1 = 0.0f,

		l_sum2 = 0.0f,


	if (y >= 0 && y < height && x - KERNEL_RADIUS >= 0 && x + KERNEL_RADIUS < width)


		for (int dx = x - KERNEL_RADIUS; dx <= x + KERNEL_RADIUS; dx++)


			l_f_blur =  __getGaussCoeffs(x - dx, SPATIAL_DEVIATION)

				  * __getGaussCoeffs(image[INDEX(x,y,image_pitch)] - image[INDEX(dx,y,image_pitch)], INTENSITY_DEVIATION);

			l_sum1 += image[INDEX(dx,y,image_pitch)] * l_f_blur;

			l_sum2 += l_f_blur;


		result[ INDEX(x,y,result_pitch) ] = ((int)fabs( l_sum1 / l_sum2));



SPATIAL_DEVIATION, INTENSITY_DEVIATION and KERNEL_RADIUS are macros that contain compile time constant values; INDEX is a macro that maps a block-relative thread index to grid-relative index.

nvcc Arguments are gencode=arch=compute_20,code=“sm_20,compute_20” --ptxas-options=-v