Unexplained Output from Compute Visual Profiler

I am implementing a radix2 fft kernel on a CC 2.1 GPU, no shared memory no texture yet.It is based on the famous Microsoft paper.
Profiler is reporting non-zero values for “branch” and “divergent branch”. Block size is 512x1x1, N = 2^20 for the following kernel. I cannot understand how any thread will diverge in this code:

global void cudaR2FFT0(uint32_t N, uint32_t Ns, float2* in, float2* out)
{
uint32_t gTid = (blockIdx.x * blockDim.x) + threadIdx.x;
float2 v0, v1, w, tmp;
uint32_t idx = gTid;
tmp = in[idx];
v0.x = tmp.x;
v0.y = tmp.y;
float angle = (-2.0f * (float)M_PI * (gTid % Ns)) / (Ns * 2);
idx += (N >> 1);
tmp = in[idx];

w.x = cos(angle);
w.y = sin(angle);
v1.x = (tmp.x * w.x) - (tmp.y * w.y);
v1.y = (tmp.x * w.y) + (tmp.y * w.x);

tmp = v0;
v0 = tmp + v1;
v1 = tmp - v1;

idx  = (((gTid / Ns) * Ns * 2) + (gTid % Ns));
out[idx] = v0;
idx += Ns;
out[idx] = v1;

}

The branches could come from the division and modulo operators used in the code. The GPU does not have hardware support for these operations, so they are implemented as software subroutines. By the way, if your intention is optimize this code for performance, I would suggest

(1) using sincospi() instead of sin( …* M_PI) and cos (… * M_PI).
(2) avoiding division and modulo operations

I think you will also find these suggestions in the Best Practices Guide. Note that M_PI is likely defined as a double-precision constant in your host’s math.h header file, and the use of a double-precision constant causes the entire expression to be evaluated in double precision according to C/C++ type promotion rules.

Thanks,
I was using sincos() routine but that was causing 20 extra registers for the kernel.
The modulo and division operators have been converted to bit manipulation but still branches and divergent branches are non-zero.
The kernel is launched 20 times, only in the very first one divergent branch is 0 but branch is not. All the subsequent launches report non zero (but decreasing) number of branches and divergent branches.
Had not used the profiler for a long time. Now I find out that CC 2.0 and later machines do not have the warp serialize and coalesced global memory access counters. How am I going to tell if there are shared memory conflicts(not for this example obviously) and non-coalesced global memory accesses?

I am a bit surprised. The register usage of separate sin() and cos() calls vs a single sincos() call should be approximately equal, in my experience. In any event, sincospi() requires fewer registers than sincos() and is also faster, which leads to the above suggestion.

The profiler can only provide information that can be extracted from given hardware with event counters. The hardware support for profiling is often improved in new GPU generations. I see no reason to be concerned about branches in this code. There are branches in math library functions such as the trig functions, and operators like division and modulo.

I started using -Xptxas -v as you suggested the other day on a different thread.
I am familiar with sincosf(), sincos() and __sincosf() (the fastmath flavor), what is sincospi() (is it cuda 5.0)?
I used bit tricks for mod and div, took out the trig functions altogether, got wrong results obviously but the profiler is still reporting branches and divergent branches.

sincospi: See http://docs.nvidia.com/cuda/cuda-math-api/index.html

branches: You might want to have a look at the generated machine code by disassembling with cuobjdump --dump-sass. If you see BRA instructions, those are branches. I assume CAL (call) instructions are also counted as branches but I don’t know for sure.

Is there a particular reason you want to implement a radix-2 FFT kernel? Small radices (2,3,5) and combinations thereof is where CUFFT provides the best performance, and I would expect is hard to beat. Is there some functionality missing from CUFFT that you need? If so, consider filing an enhancement request.

Thanks,

Debugging assembly is a bit more than what I can invest timewise. Whatever the generated instructions are, the threads in a warp are simd/simt synchronous as long as there are no data dependent conditionals. Still no reason for the divergent branching that I can see.
The r2 FFT is not an end in itself, it is merely a warmup/segue into my real project.
Last I used the profiler, I was on CC 1.3, now I am on CC2.1(mostly) and CC3.0.
Disappointed to see that the ability to catch smem bank conflicts and non-coalesced load/stores are gone.

I am not suggesting to debug at the assembly level, I am just suggesting doing a quick sanity check. Your question was why there are non-zero branch statistics, my working hypothesis is that the machine code actually contains branches. If there are no branches anywhere in the machine code but the profiler reports non-zero branch stats, that could point to an issue with the profiler.

I don’t think the metrics you are looking for are necessarily gone, but possibly the metrics are now named differently. Off the top of my head, the shared memory statistics are called something like shared_memory_{load|store}_efficicency these days (which I find more useful than a raw count of shared memory bank conflicts). I have only just recently started using the nvprof profiler more intensely, so I am not an expert on interpreting profiler output.

[later:]

Looking at the output of nvprof --query-events and nvprof --query-metrics applied to an sm_21 device, the shared memory bank conflicts stats are there, as an event counter l1_shared_bank_conflict. Due to the different way global memory transactions are handled on sm_2x and sm_3x (compared to sm_1x) a traditional “uncoalesced global memory access” event counter does not make sense. nvprof provides the metrics gld_efficiency and gst_efficiency provided which should serve that purpose on newer GPU architectures (such as sm_21).

If you are on the Windows platform Nsight Visual Studio Edition CUDA profiler supports collection of source correlated counters that will accurately show you inst_executed, thread_inst_executed, not_predicated_off_thread_inst_exectued, branch_executed, branch_taken, divergent_branch_executed, and many memory statistics per SASS instructions. These counters values are rolled up to PTX and high level source code.

Fermi and Kepler architectures support counters for assessing the efficiency of your memory accesses to L1 and bank conflicts to shared memory. The nvprof metrics are:

shared_replay_overhead: Average number of replays due to shared memory conflicts for each instruction executed

global_replay_overhead: Average number of replays due to local memory cache misses for each instruction executed

global_cache_replay_overhead: Average number of replays due to global memory cache misses for each instruction executed

local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load

local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store

shared_load_transactions_per_request: Average number of shared memory load transactions performed for each shared memory load

shared_store_transactions_per_request: Average number of shared memory store transactions performed for each shared memory store

gld_transactions_per_request: Average number of global memory load transactions performed for each global memory load

gst_transactions_per_request: Average number of global memory store transactions performed for each global memory store

local_load_transactions: Number of local memory load transactions

local_store_transactions: Number of local memory store transactions

shared_load_transactions: Number of shared memory load transactions

shared_store_transactions: Number of shared memory store transactions

gld_transactions: Number of global memory load transactions

gst_transactions: Number of global memory store transactions

If you run the Visual Profiler memory analysis and any of the transactions per request values are high the analysis will provide you a link to the source line responsible for the memory operation.