Using cudaprof has given me some benefit in order to optimise the code, usually identifying uncoalesced reads and writes so I can transform them into coalesced reads and writes. But there are a couple of counters provided with cudaprof that I would like help with.
Warp divergence - as I understand it from reading the programming manual, if there is a data-dependent condition e.g. if-else, statement in a kernel that a warp is executing then that if-else statement is executed in serial by the warp. Is that correct, and how can this be overcome? For example, I am working with particles, and I have real or virtual particles which are set up and treated differently. In some kernels I have an if(threadid<NREAL) {}else{/virtual}, so I guess I could have two different kernels for these, but can I execute them concurrently? But in another kernel a real particle calculates a viscous term conditional on the value of a function of its velocity and position, and I can’t see how I can get around that one.
A divergent warp will only occur if the divergence is within a warp. So have NREAL be a multiple of warp size and you wont (shouldnt) have any problem.
cta launch is the number of blocks launched on a given multiprocessor (the one being profiled)