Visual profiler and compute capability 1.3

Trying to use visual profiler to assess the amount uncoalesced global memory access. However I can’t figure of how to use the counters available (gld/gst request and gld/gst_32/64/128b) for compute capability 1.2 and higher. I’ve been looking around in the forum but I can’t find some relevant answers.

Seriously, doesn’t anyone have an answer to my question?

The strict definition of memory coalescing that used to apply to compute 1.0/1.1 hardware doesn’t really apply to newer hardware. There used to be only two possibilities - either loads and stores were coalesced or they weren’t. Now the hardware has some extra modes which relax things considerably. The profile counters reflect this.

Now you get two sets of counters: one for set requests and one for transactions. Fully coalesced access should have a 1:1 correspondence between the number of requests (at the half warp level), and the number of memory controller transactions required to service those requests. The worst case ratio should be 16, ie. every request was fully serialized and every half-warp required 16 transactions per request.

So that means that gldt request and should match sum(gld_32/64/128b)?

No, quite the opposite. The actual number of loads and stores executed should be as large or larger than the number of requests. The model is that one request produces at least 1 and as many as 16 transactions, depending on how the request fits into the coalescing rules. This is illustrated in some detail in Appendix G of the Cuda 3.0 programming guide.