coalesce counter meaning

I have two kernels which take the same amount of input data, but read it in different ways (x and y transform). Currently it’s pretty unoptimized. I don’t see anything with gld_uncoalesced, but I see values for “gld_32b gld_64b gld_128b”, which are similar to what I would expect. However, they don’t sum to the same amount of input data. Are there more counters I’m missing?

Unfortunately the visual profiler isn’t starting for me (if it has the answers); says “Unable to load the ‘cuda’ library.”

Where are you seeing these? I’m not sure how much info that will give you regarding coalesced reads/writes…

If the visual profiler does not work, I’d suggest trying text based profiling…http://www.ddj.com/cpp/209601096?pgno=2

That’s what I’m using, thanks.

method		  gputime	cputime	occupancy	   gld_incoherent	 gld_32b	gld_64b	gld_128b

large_frame_x   3734.08	3757	   0.75			0				  4998	   9690	   3060

large_frame_y   6784.19	6803	   0.75			0				  200880	 240		480

I have a stupid question.

what do “gld_32b, gld_64b, gld_128b” mean?

I am usually using cuda profiler or cuda visual profiler Version 1.0, so I ever get that parameters.

I don’t really know, but /usr/local/cuda/doc/CUDA_Profiler_2.2.txt says “Number of 32 byte global memory load transactions.” I think operations of different sizes can be coalesced, e.g. a few bytes to a word, etc. I guess the 32-bit counters are regardless of the original request type though, because my kernel is requesting non-sequential 32-bit words, and gld_incoherent is zero for everything (maybe it’s deprecated in 2.2?).

You have to be aware of how the G200 hardware works. There are no more “coherent” and “incoherent” loads, so those counters will always be 0 on compute 1.3 hardware. The hardware will take a memory read and split it into as many 128, 64, and 32-byte memory transactions as it needs to meet the reads you requested (see the programming guide for more information).

The profiler counters are telling you how many reads of each type occurred. I haven’t played with it yet, but it would only make sense if it worked this way. I.e., if you perform fully coalesced loads/stores of 32-bit values, you should see only 128-byte loads counted in the profiler. If your memory reads/writes are not fully coalesced, you should see them the hardware performing 1) more transactions overall and 2) some 64 and 32-byte transactions to meet your memory access pattern.