The threadCountPerBlock is 32,8,8 == 2048 threads…
Did something change with GTX 280? I thought 512 was the limit for blocks…
QUESTION 2
Also the profiler mentions that
If the “incoherent” counter is always zero – how do we find how good our kernels are? Unless we have a ratio between Coherent and Incoherent - it is difficult to interpret meaningful information.
It is the same case on my TESLA as well (I checked minimally though). So, Is this known issue aplicable to TESLA C1060 as well ??
SOME FINDING
Also I find that the “instructions” counter increases with increase in iteration of FOR loops… And, there is a notable increase in “gld_coherent” values as well – indicating that instructions are stored in global memory with a small cache in the MP. This has been long discussed in the forums. Atleast, now we have some credible proof.
The kernel I checked did NOT have a single “global memory load” and I was seeing “gld_coherent” increasing with the increase of “instructions”
Hi,
Question 1: where do u see it in the visual profiler? I doubt it has changed.
Question 2: As far as I know, since you now have the Global memory read/write/overall throughput statistics
nVidia removed the incoherent stats. I think tmurray had a reasonable answer why (it was hard to interepret those
as well). I think the new method is not that great either. Take a look here: [url=“http://forums.nvidia.com/index.php?showtopic=99433”]http://forums.nvidia.com/index.php?showtopic=99433[/url]
As for the findings, did you have lmem access in the kernel? that might also explain that.
I find that in the profiler documentation in “doc” directory “cuda_profiler_2_2.txt”
That is quoted as an example.
The profiler documentation snippet I postd above gives the reason. Due to the smart coalescing hardware, they are no more able to find in-coherency. I am just wondering if TESLA also comes under that cattegory.