cuda profiler and gt280 ava. performance counters

Hi, Just got GTX280, and started playing with the Visual Profiler.

However I am not able to turn on all performance counters: gld_uncoalesced and gst_uncoalesced are unclickable/shaded.

As I don’t have any other card here to compare it quickly, it’d be great if somebody could tell me if these counters are not supported by GTX280 or I’m not doing something correctly. If it’s not possible to turn on these counters, is there some alternative way to find out the number of uncoalesced accesses?

Thanks in advance for a quick reply!

Those counters are not supported internally on GTX 280 w/ CUDA 2.0. I haven’t checked to see if that has changed in the 2.1 beta.

GT200 has a different kind of memory subsystem. Therefore uncoalesced accesses do not exist anymore on this hardware. It is explained in the programming guide, but in general on GT200 hardware performance is determined by the amount of memory transactions. Coalesced (in pre-GT200 terms) access have the lowest amount of transactions, but semi-random accesses can have the same amount of transactions on GT200, while before all accesses would be uncoalesced.

Hi, Many thanks for such a quick reply! I’ve just checked the programming guide 2.0, but I’m even more confused now with the numbers I get from profiler. The programming guide says:

"The global memory access by all threads of a half-warp is coalesced into a single memory transaction as soon as the words accessed by all threads lie in the same segment of size equal to:

a) 32 bytes if all threads access 8-bit words,

B) 64 bytes if all threads access 16-bit words,

c) 128 bytes if all threads access 32-bit or 64-bit words.

Coalescing is achieved for any pattern of addresses requested by the half-warp, including patterns where multiple threads access the same address. "

If there are 16 threads in a half-warp and each thread accessing 1 integer element, that would be give us 64B for a memory transaction.

On the older versions, the mapping of threads and the addresses was quite strict; now we have a relaxation that any mapping patterns are allowed, as long as they are inside memory chunk which is 2x larger than the number of words to be transferred. Did I get it right here?

So, what would be the minimal number of memory transactions for loading elements of an int array? For example, if I have an array with 64MB ints, and have each int loaded by 1 thread, would the min. number of memory transactions correspond to the number of half-warps?

How can I figure out the number of memory transactions, and see if there is any need to optimize? Is there any support by visual profiler for this?

You know this statement is not correct in any precise way. Uncoalesced and coalesced accesses still exist on GT200, there is just an additional class of partially coalesced accesses.

RoofTopG, I think you understand it. Min number of transactions is 1 per half-warp (per instruction). There is no tool to check if your code is doing what you expect, certainly not in the visual profiler. But it’s possible to write a C++ class with an overloaded operator that would do the checking. (Would be great if this was a feature of the Emulation mode, along with checking bank conflicts, out-of-bounds accesses, and the rest of these critical memory issues.)

Or, it might be worth it to find an older card.

p.s. what do you and your other Gs do on your rooftops?

Okay, let me be more precise ;)

Before GT200, if you did not comply with coalescing rules, you would get 16 memory transactions, no matter what. Now it will be the minimum amount of transactions possible. If your access does not allow bundling of your reads at all, you will still have 16 memory transactions.

So, nobody has answered why the uncoalesced memory loads and stores are unavailable in the new profiler. I find it very disappointing, because memory accessing is the first thing that needs to be optimized. Is it a restriction of the hardware?

I have a Tesla C1060.

Because such a counter would be meaningless. If you’re using 2 of the 16 slots for a particular transaction, is that a coalesced or an uncoalesced access? What if you use 8? What if you use 15?

Use the transaction counters in the 2.2 profiler–that’s what you actually care about.

Well, it would make sense to know [un]coalesced memory transactions if somebody was accessing the global memory in a way that would throw the threads of a half warp completely out of whack and it would send them in different segments. Am I right?

Can you elaborate how can we interpret the 32b/64b/128b memory transactions to [un]coalesced memory accesses?

For example, I have two kernels that access the global memory with int2’s. The one I think access the global memory in a coalesced fashion has 3468 gld 64b and 3991104 gld 128b transactions.

The other one (which again uses int2’s) has 63857664 gld 32b and 3468 64b transactions. Does this one have so many gld 32b because they are uncoalesced, and they are split from 128b’s to 32b’s?

However, although I think the first one is the coalesced kernel when I use the “Global Memory Throughput” of the profiler I get 4.16 BG/s for the first kernel and 16.75 GB/s for the second one…

Are your kernels also doing computation? I think the bandwidth usage is calculated simply by dividing the total memory throughput by the kernel runtime. If you do a lot of computation this will of course reduce your used bandwidth. 4 Gb / s is quite low. I have kernels that, only doing 128b transactions, saturate C1060 bandwidth and achieve > 90 GB / s on the GTX 280.

My kernels do computation, and they take approximately the same time. I interpretation of the fact that the coalesced kernel has much lower BW than the uncoalesced one is this: in the Programming Guide it says “Unused words in a memory transaction are still read, so they waste bandwidth. To reduce waste, hardware will automatically issue the smallest memory transaction that contains the requested words.” So, for my uncoalesced kernel there are more transactions but each one wastes a lots of BW, but overall the used BW achieved is high. For the coalesced kernel there is less BW wasted and the one achieved is the useful one.

I think my kernel is not BW bounded but, computationally bounded.