How to Verify Coalescing in new Devices

BHC · July 19, 2010, 3:35pm

I have a Fermi Tesla C2050 on CentOS 5. The Cuda Profile Log Version is 2.0.

I am trying to determine if there are any non-coalesced memory accesses occurring in my application. However, when I place the following lines in my config.txt file:

gld_incoherent
gld_coherent

I get:

NV_Warning: Ignoring the invalid profiler config option: gld_incoherent
NV_Warning: Ignoring the invalid profiler config option: gld_coherent

A quick check of the documentation in the cuda profiler documentation explains that this option is only supported on devices of capability 1.X. I don’t see any option that will work for 2.X devices. How can I get this information?

ONeill · July 20, 2010, 4:25pm

Just look at “gld load” and “gld store”. These show the number of global memory transactions. If you have a read/write ratio of 1:1 in case of coalescing you will see a 1:1 ratio in between these counters, too. If you have e.g. more gld stores than gld loads than your write access isnt (fully) coalesced.

BHC · July 24, 2010, 5:56pm

Thank you for the reply. I see your logic. A coalesced block of memory transactions should only increase the count by 1. Therefore a fully coalesced application should preserve the read/write ratio. But unless I’m misunderstanding, this isn’t exactly a five star workaround. Not only does the programmer have to determine his actual read/write access ratio (which may not be a trivial task), but the test will produce invalid results if coalescing is not being achieved for any transactions.

It doesn’t seem logical for NVIDIA to remove such a useful feature unless it is no longer needed. Does the L1 cache negate the concept of coalescing? Since an entire cache line is loaded whenever a word is accessed, it seems that the emphasis has shifted to locality of access rather than coalesced access. This would explain why a non-coalesced read/write counter would be replaced by an L1 cache miss counter. Did I just answer my own question?

ONeill · July 26, 2010, 8:38am

No there is still coalescing with Fermi cards. But all that remains necessary for achieving it is alignment to 128 Byte memory segments. Im not exactly sure what the L1 cache miss is about or when you can see it happen but in case of a cache miss you have to access memory with the throughput of global memory instead of the much faster L1.

forresti · December 8, 2012, 8:48pm

Since gld_incoherent and gld_coherent no longer work, is there a new list of keywords that can go in a config.txt file in Visual Profiler 2.0 and later?

Topic		Replies	Views
Profiler coalescing counters On a GTX 260 CUDA Programming and Performance	4	2462	August 13, 2008
cuda profiler error about coalesced store CUDA Programming and Performance	2	1136	January 6, 2010
How can I tell if my memory accesses are being coalesced? CUDA Programming and Performance	5	1396	June 23, 2009
problems about cudaprof CUDA Programming and Performance	2	1402	February 18, 2010
Visual profiler and compute capability 1.3 CUDA Programming and Performance	4	10013	May 3, 2010
Tracking down non-coalesced events with profiler CUDA Programming and Performance	1	1081	May 17, 2009
CUDA Profiler documentation Few questions and some interesting facts CUDA Programming and Performance	5	6225	July 20, 2009
Cuda Profiler 1.1 - question on gst coalesced value CUDA Programming and Performance	1	1637	April 5, 2009
Profiler not reporting coalesced ld/st CUDA Programming and Performance	1	464	January 19, 2011
coalesce counter meaning CUDA Programming and Performance	5	4387	April 15, 2009

How to Verify Coalescing in new Devices

Related topics