coalesce counter meaning

gatoatigrado · April 14, 2009, 8:42pm

I have two kernels which take the same amount of input data, but read it in different ways (x and y transform). Currently it’s pretty unoptimized. I don’t see anything with gld_uncoalesced, but I see values for “gld_32b gld_64b gld_128b”, which are similar to what I would expect. However, they don’t sum to the same amount of input data. Are there more counters I’m missing?

Unfortunately the visual profiler isn’t starting for me (if it has the answers); says “Unable to load the ‘cuda’ library.”

jph4599 · April 14, 2009, 9:24pm

Where are you seeing these? I’m not sure how much info that will give you regarding coalesced reads/writes…

If the visual profiler does not work, I’d suggest trying text based profiling…http://www.ddj.com/cpp/209601096?pgno=2

gatoatigrado · April 15, 2009, 2:54am

That’s what I’m using, thanks.

method		  gputime	cputime	occupancy	   gld_incoherent	 gld_32b	gld_64b	gld_128b

large_frame_x   3734.08	3757	   0.75			0				  4998	   9690	   3060

large_frame_y   6784.19	6803	   0.75			0				  200880	 240		480

Quoc_Vinh · April 15, 2009, 3:03am

That’s what I’m using, thanks.

method		  gputime	cputime	occupancy	   gld_incoherent	 gld_32b	gld_64b	gld_128b

large_frame_x   3734.08	3757	   0.75			0				  4998	   9690	   3060

large_frame_y   6784.19	6803	   0.75			0				  200880	 240		480

I have a stupid question.

what do “gld_32b, gld_64b, gld_128b” mean?

I am usually using cuda profiler or cuda visual profiler Version 1.0, so I ever get that parameters.

gatoatigrado · April 15, 2009, 4:40am

I don’t really know, but /usr/local/cuda/doc/CUDA_Profiler_2.2.txt says “Number of 32 byte global memory load transactions.” I think operations of different sizes can be coalesced, e.g. a few bytes to a word, etc. I guess the 32-bit counters are regardless of the original request type though, because my kernel is requesting non-sequential 32-bit words, and gld_incoherent is zero for everything (maybe it’s deprecated in 2.2?).

MisterAnderson42 · April 15, 2009, 1:06pm

You have to be aware of how the G200 hardware works. There are no more “coherent” and “incoherent” loads, so those counters will always be 0 on compute 1.3 hardware. The hardware will take a memory read and split it into as many 128, 64, and 32-byte memory transactions as it needs to meet the reads you requested (see the programming guide for more information).

The profiler counters are telling you how many reads of each type occurred. I haven’t played with it yet, but it would only make sense if it worked this way. I.e., if you perform fully coalesced loads/stores of 32-bit values, you should see only 128-byte loads counted in the profiler. If your memory reads/writes are not fully coalesced, you should see them the hardware performing 1) more transactions overall and 2) some 64 and 32-byte transactions to meet your memory access pattern.

Topic		Replies	Views
something wrong with cuda visual profiler CUDA Programming and Performance	3	11934	October 21, 2009
gld counter - visual profiler question CUDA Programming and Performance	1	2294	June 12, 2009
Unexpected Profiler output, zeros for all global read/write CUDA Programming and Performance	3	1966	December 23, 2008
counting non-coalesced accesses CUDA Programming and Performance	4	1184	April 23, 2010
Visual profiler and compute capability 1.3 CUDA Programming and Performance	4	10013	May 3, 2010
Tracking down non-coalesced events with profiler CUDA Programming and Performance	1	1081	May 17, 2009
What information does "gld_request" provide? (cudaProf Counter) CUDA Programming and Performance	0	4181	February 17, 2010
Profiler not reporting coalesced ld/st CUDA Programming and Performance	1	464	January 19, 2011
Profiler coalescing counters On a GTX 260 CUDA Programming and Performance	4	2462	August 13, 2008
problems about cudaprof CUDA Programming and Performance	2	1402	February 18, 2010

coalesce counter meaning

Related topics