Visual profiler results

I’m tring to tune my kernels a bit with the visual profiler, beside the fact that it doesnt count
texture reads as part of the memory throughput, I find understanding the results a bit confusing.
How do I figure if the values I see are too high/low? what does gld coalesced == 187744 mean?
Is 493730 for branch and 9851 for divergent branch is high/low?
what does instruction throughput of 0.9 vs 0.2 mean?

I think it would be great if nVidia could issue some thumb-rules or some sort of an article
to better understand those values, for example: divide instruction throughput by the branch count
and a value above X means…

I tried to google for some information but all I got was the readme for the profiler.



As far as I know, gld coalsced is the number of read from device memory transactions, and the number of branch is the total amount of code branch occuring in your code. A branch is a sequence of instruction on a given set of data ; when you have a conditionnal statement for instance (like “if”), a branch can be created if your code do some work on some part of your data, and some other work on the other part of the data (inside a single warp, that is a set of consecutive 32 threads). When you have 2 differents treatement on a same set of data, you have a divergent branch. Divergent branchs make the GPU do additionnal work, so they must be avoided (CUDA is based, until now, on a SIMD architecture, not a MIMD one).

So the number of branch and the gld coalesced are not very meaningfull by themselve, as they grow with the amount of data you are providing to your program. The ratio of divergent branch over number of branch can tell you how heterogeneous your process is : the less it is, the more efficient your code is (keep in mind it’s not always possible to avoid completely divergent branching if you don’t used one threaded-block) In your case, you have a ratio of 2%, which is quite good.
To know if you make good memory transaction, use the gld 32/64/128b indicator. You are efficient when you have more 128b transaction, less efficient when you have 32b sized transaction.

Instruction throughput estimates the arithmetical efficientcy of your program, but I don’t know how it is calculated.

Ok that I understand. What I meant was given I have 187744 gld coalsced, how good is it? Lets say that my kernel takes Xms,

how do I figure how much of it was due to memory operations (either coalsced or not), how much time was because of divergent

branches, wrap serialization or instructions??

As far as I remember the GB/s throughput calculated by the new profiler version, is simply kernel time divided by the number

of gmem reads. However if my kernel is compute intensive the GB/s throughput wont tell me anything. And if that is the case

how do I know what hurts my performance the most: instruction, wrap serialize, branches?? the numbers just dont add up or

make sense all together.

edit: My kernel has an additional problem, which most of the gmem reads are done by texture fetches, that dont show

up in the GB/s statistics for some reason. Hence its a real problem understanding where my kernel works hard.

Ok that makes sense :)

Intel’s vTune has some thumb rules that might give some indications as to what might be the bottleneck in the code, such

as L1 cache miss > 75% is bad (or whatever)… I can’t 100% say its bullet proof advice but its a help.

I find something like that missing from the profiler/documentation