I have a Fermi Tesla C2050 on CentOS 5. The Cuda Profile Log Version is 2.0.
I am trying to determine if there are any non-coalesced memory accesses occurring in my application. However, when I place the following lines in my config.txt file:
NV_Warning: Ignoring the invalid profiler config option: gld_incoherent
NV_Warning: Ignoring the invalid profiler config option: gld_coherent
A quick check of the documentation in the cuda profiler documentation explains that this option is only supported on devices of capability 1.X. I don’t see any option that will work for 2.X devices. How can I get this information?
Just look at “gld load” and “gld store”. These show the number of global memory transactions. If you have a read/write ratio of 1:1 in case of coalescing you will see a 1:1 ratio in between these counters, too. If you have e.g. more gld stores than gld loads than your write access isnt (fully) coalesced.
Thank you for the reply. I see your logic. A coalesced block of memory transactions should only increase the count by 1. Therefore a fully coalesced application should preserve the read/write ratio. But unless I’m misunderstanding, this isn’t exactly a five star workaround. Not only does the programmer have to determine his actual read/write access ratio (which may not be a trivial task), but the test will produce invalid results if coalescing is not being achieved for any transactions.
It doesn’t seem logical for NVIDIA to remove such a useful feature unless it is no longer needed. Does the L1 cache negate the concept of coalescing? Since an entire cache line is loaded whenever a word is accessed, it seems that the emphasis has shifted to locality of access rather than coalesced access. This would explain why a non-coalesced read/write counter would be replaced by an L1 cache miss counter. Did I just answer my own question?
No there is still coalescing with Fermi cards. But all that remains necessary for achieving it is alignment to 128 Byte memory segments. Im not exactly sure what the L1 cache miss is about or when you can see it happen but in case of a cache miss you have to access memory with the throughput of global memory instead of the much faster L1.