CUDA profiler & T10P

Does the T10P early engineering sample support profiling through CUDA_PROFILE?

It seems that the profiler itself works, the gpu-/cpu-time measurements are reasonable, the reported occupancies also make sense (considering the increased number of registers). However, I’m suspicious about the figures for incoherent loads & stores for example. I have a kernel for which almost all global loads & stores are incoherent on a C870. Running the same code on the T10P the profiler reports zero incoherent loads and stores? Actually, all my kernels seem to have all their global memory accesses coalesced, which would be great, but I don’t believe in this kind of magic.

Has anybody made similar experiences?

forgot to mention: I’m running driver version 177.10 and toolkit 2.0.1640.

Alex

Thats interesting … i tried to make all my reads coalesced but still i was surprised when i got the same as u zero incoherent memory accesses for both read and write.

So i guess it was to good to be true.

On the new hardware the concept of “uncoalesced” is quite a bit different. Instead of all accesses for a warp being either coalesced or not, the T10P GPU just generates one ore more memory transactions. For example, if the addresses within a warp are sequential, but misaligned, we have to generate two memory transactions (since each transaction must be aligned). If they are aligned, but have a stride of 2 32-bit words (rather than 1), then this is again 2 transactions. A stride of 4 would be 4 transactions, etc.

As a result, the hardware doesn’t report a count of uncoalesced loads like on G80. Therefore, the current profiler, which was built around the signals for the old hardware, just reports zero for this.

I believe our profiler team is working on better support for the new GPU.

Thanks,
Mark

Hmm, okay.

I think, I slowly start to get the picture.

So, for the new hardware the ratio of

memory instructions / memory transactions

would be the metric of interest, rather than the number of (un)coalesced accesses.

Thanks (also for your answers on the other posts)

Alex