Does the T10P early engineering sample support profiling through CUDA_PROFILE?
It seems that the profiler itself works, the gpu-/cpu-time measurements are reasonable, the reported occupancies also make sense (considering the increased number of registers). However, I’m suspicious about the figures for incoherent loads & stores for example. I have a kernel for which almost all global loads & stores are incoherent on a C870. Running the same code on the T10P the profiler reports zero incoherent loads and stores? Actually, all my kernels seem to have all their global memory accesses coalesced, which would be great, but I don’t believe in this kind of magic.
Has anybody made similar experiences?
forgot to mention: I’m running driver version 177.10 and toolkit 2.0.1640.