Few profiler related questions
I find this in the profiler documentation 2.2
The threadCountPerBlock is 32,8,8 == 2048 threads…
Did something change with GTX 280? I thought 512 was the limit for blocks…
Also the profiler mentions that
If the “incoherent” counter is always zero – how do we find how good our kernels are? Unless we have a ratio between Coherent and Incoherent - it is difficult to interpret meaningful information.
It is the same case on my TESLA as well (I checked minimally though). So, Is this known issue aplicable to TESLA C1060 as well ??
Also I find that the “instructions” counter increases with increase in iteration of FOR loops… And, there is a notable increase in “gld_coherent” values as well – indicating that instructions are stored in global memory with a small cache in the MP. This has been long discussed in the forums. Atleast, now we have some credible proof.
The kernel I checked did NOT have a single “global memory load” and I was seeing “gld_coherent” increasing with the increase of “instructions”