I’m tring to tune my kernels a bit with the visual profiler, beside the fact that it doesnt count
texture reads as part of the memory throughput, I find understanding the results a bit confusing.
How do I figure if the values I see are too high/low? what does gld coalesced == 187744 mean?
Is 493730 for branch and 9851 for divergent branch is high/low?
what does instruction throughput of 0.9 vs 0.2 mean?
I think it would be great if nVidia could issue some thumb-rules or some sort of an article
to better understand those values, for example: divide instruction throughput by the branch count
and a value above X means…
I tried to google for some information but all I got was the readme for the profiler.
As far as I know, gld coalsced is the number of read from device memory transactions, and the number of branch is the total amount of code branch occuring in your code. A branch is a sequence of instruction on a given set of data ; when you have a conditionnal statement for instance (like “if”), a branch can be created if your code do some work on some part of your data, and some other work on the other part of the data (inside a single warp, that is a set of consecutive 32 threads). When you have 2 differents treatement on a same set of data, you have a divergent branch. Divergent branchs make the GPU do additionnal work, so they must be avoided (CUDA is based, until now, on a SIMD architecture, not a MIMD one).
So the number of branch and the gld coalesced are not very meaningfull by themselve, as they grow with the amount of data you are providing to your program. The ratio of divergent branch over number of branch can tell you how heterogeneous your process is : the less it is, the more efficient your code is (keep in mind it’s not always possible to avoid completely divergent branching if you don’t used one threaded-block) In your case, you have a ratio of 2%, which is quite good.
To know if you make good memory transaction, use the gld 32/64/128b indicator. You are efficient when you have more 128b transaction, less efficient when you have 32b sized transaction.
Instruction throughput estimates the arithmetical efficientcy of your program, but I don’t know how it is calculated.