Hi,
I get a toy kernel which contains only one branch and I want to use it to understand how the branching affects the performance,

The kernel is as follows: global S2_3(…)
{
unsigned int rid=blockIdx.xblockDim.x+threadIdx.x;
if (rid<470400){
…
}
}
the grid size is 3676, block size is 128. (that is, there are 3676128=470528 threads launched)
Since 470400 is a multiple of 128, there should not be any divergent branch at all.
However when I use cudaprof to profile, the result is weird, when sm cta_launched and cta_launched are turned off.
the profiler gives me the expected divergent branch( which is zero), but when I turn on the sm cta_launched and cta_launched, things become strange,
sm_cta_launched count is 611, cta_launched is 1848, and the divergent branch is 611. How does non-zero divergent count come out?

The computing device is GT 220, which is of compute capability 1.2 and contains 6 multiprocessors.

I read the documentation of cudaprof, it says the cta_launched is the number of threads blocks launched on a Texture Processing Cluster(TPC). But under compute capability 1.2, there are two multiprocessors per TPC. So cta_launched is about 611*2=1222 which does not match the number 1848 given by the profiler. I think I misunderstand something here. Anyone can give me some help? Thank you.