Branch / Divergent branches


I would like to know why this code give me 4 branches and 1 divergent branch, when I analyse it with Visual Profiler…?

#include <stdio.h>

#include <stdlib.h>

__global__ void teste(float *global_mem) {

	if ( threadIdx.x < 20 ) {

		float my_reg = global_mem[threadIdx.x];

		global_mem[threadIdx.x] = my_reg + 1;


	else {

		float my_reg = global_mem[20 - threadIdx.x/20];

		global_mem[20 - threadIdx.x/20] = my_reg + 100;



int main(void) {

	float *device_g_mem;

	float *host_g_mem;

	int tam_mem = sizeof(float) * 20;

	host_g_mem = (float *) malloc(tam_mem);

	cudaMalloc((void **)&device_g_mem, tam_mem);

	for (int i=0; i<20; i++) host_g_mem[i] = i;

	cudaMemcpy(device_g_mem, host_g_mem, tam_mem, cudaMemcpyHostToDevice);

	teste<<<1, 64>>>(device_g_mem);

	return 0;


I’ll assume these are normalized results you’re talking about (4 branches, 1 divergent).

Thread divergence on 1.x devices is determined on a half-warp level (because all threads of a half-warp execute the same instructions - as to why this is the case, read up on the programming manual (to do with instruction scheduling frequency / clock frequency of processors / etc)) - as such the profiler (when it normalizes the counters) reports 1 of these for every half-warp that hits one.

On 1.x devices a warp is 32 threads, thus a half-wrap is 16 threads.

You’re executing 64 threads in this kernel, so 2 warps (4 half-warps).

Your branch is your if/else statement (if(threadIdx.x < 20) { … } else { … }), this is considered a branch no matter what (even if all threads take the same path, divergent branches are where threads take different paths) - and according to your code all threads will end up evaluating this branch… this is where your ‘4 branches’ in the profiler comes from.

The 1 divergent branch is from half-warp 2 (threads 16-32), where threads 16-19 will enter the ‘if(…)’ statement while threads 20-31 will enter your ‘else’ statement.

As I said before, this is a divergent branch because threads of a half-warp have to execute the same code, thus for a half-warp to execute any branch - the instructions have to be serialized - s.t. each set of threads in the divergent branch takes turns executing their part of the branch - and then resume together outside the branch once each set of threads has had it’s turn… (if that makes sense).

Thanks for your didatic post, Smokey!

… by the way, do you know some text about cuda visual profiler if possible with some examples?

Your best bet to understand the visual profiler is to read the Help -> “CUDA Visual Profiler Help” (F1) in the visual profiler

(cudaprof/doc/cudaprof.html under your CUDA Toolkit directory).

And reading the CUDA Programming Guide / these forums (many people have asked similar questions, nVidia and other people in the know have responded with some useful bits of info that aren’t documented anywhere else).