Clarification on Warp Divergence, and what is cta launch?

chrismc · February 11, 2009, 11:15am

Using cudaprof has given me some benefit in order to optimise the code, usually identifying uncoalesced reads and writes so I can transform them into coalesced reads and writes. But there are a couple of counters provided with cudaprof that I would like help with.

Warp divergence - as I understand it from reading the programming manual, if there is a data-dependent condition e.g. if-else, statement in a kernel that a warp is executing then that if-else statement is executed in serial by the warp. Is that correct, and how can this be overcome? For example, I am working with particles, and I have real or virtual particles which are set up and treated differently. In some kernels I have an if(threadid<NREAL) {}else{/virtual}, so I guess I could have two different kernels for these, but can I execute them concurrently? But in another kernel a real particle calculates a viscous term conditional on the value of a function of its velocity and position, and I can’t see how I can get around that one.
what is a cta launch?

MxAddict · February 11, 2009, 12:48pm

divergency is only in ‘per warp’ basic
so if you do something like this

int tid = threadIdx.x;

if (tid < 32)
{
do something;
}
else
{
do something else
}

there is no divergency
but if you do something like this

if (tid < 10)
{
do something;
}
else
{
do something else
}

there is divergency,
reassuming, branches at warp boundaries are divergency free, other are not :)

Ailleur · February 11, 2009, 12:50pm

A divergent warp will only occur if the divergence is within a warp. So have NREAL be a multiple of warp size and you wont (shouldnt) have any problem.

cta launch is the number of blocks launched on a given multiprocessor (the one being profiled)

–someone beat me to it!

Quoc_Vinh · February 18, 2009, 5:36am

Hi Ailleur.

As I understand your explain.“I am sorry if i am wrong.”

In my program I have “gridDim.x = 13, gridDim.y = 25”, and I used “Geforce 88ooGT”, the number of multiprocessor in Geforce9900GT is 14.

with your explain the number of cta_launch = (13 * 25) / 14 = 23, is it right?

I used cuda Profiler to get the cta_launch, so the number is 46 “= 23 * 2”, I don’t have any explain for it, If you or anybody know, please tell me.

thank. :)