CUDA Profiler documentation Few questions and some interesting facts

Few profiler related questions

QUESTION 1

I find this in the profiler documentation 2.2

The threadCountPerBlock is 32,8,8 == 2048 threads…

Did something change with GTX 280? I thought 512 was the limit for blocks…

QUESTION 2

Also the profiler mentions that

If the “incoherent” counter is always zero – how do we find how good our kernels are? Unless we have a ratio between Coherent and Incoherent - it is difficult to interpret meaningful information.

It is the same case on my TESLA as well (I checked minimally though). So, Is this known issue aplicable to TESLA C1060 as well ??

SOME FINDING

Also I find that the “instructions” counter increases with increase in iteration of FOR loops… And, there is a notable increase in “gld_coherent” values as well – indicating that instructions are stored in global memory with a small cache in the MP. This has been long discussed in the forums. Atleast, now we have some credible proof.

The kernel I checked did NOT have a single “global memory load” and I was seeing “gld_coherent” increasing with the increase of “instructions”

Hi,
Question 1: where do u see it in the visual profiler? I doubt it has changed.
Question 2: As far as I know, since you now have the Global memory read/write/overall throughput statistics
nVidia removed the incoherent stats. I think tmurray had a reasonable answer why (it was hard to interepret those
as well). I think the new method is not that great either. Take a look here: [url=“http://forums.nvidia.com/index.php?showtopic=99433”]http://forums.nvidia.com/index.php?showtopic=99433[/url]

As for the findings, did you have lmem access in the kernel? that might also explain that.

eyal

Furtermore, I’m now trying to run some tuning on the application.

If I run certain input data I get for my main kernel the following statistics:

66.3GB/s read + 8.9GB/s write == 75.2GB/s overall → I guess this is very good for a C1060.

Now I run the same kernel on a different data and get:

20.3GB/s read + 26.1Gb/s write == 46.4GB/s overall

Why is it so different? what should I conclude about my kernel?

Its either a bug in my tests, since the data here shouldnt have played any role in the b/w I get,

or the numbers I get are problematic…

any suggestions ?

edit: Sarnath, sorry for hijacking your post :) … this is just another try to show/understand that the GB/s are

a bit confusing in my opinion.

thanks

eyal

I find that in the profiler documentation in “doc” directory “cuda_profiler_2_2.txt”

That is quoted as an example.

The profiler documentation snippet I postd above gives the reason. Due to the smart coalescing hardware, they are no more able to find in-coherency. I am just wondering if TESLA also comes under that cattegory.

No.

Where is that document?

I could only find documentation for the 1.2 profiler.

CUDA Visual Profiler v1.2 Readme (even though it says 2.2 in the text)

In either case, I couldn’t find any reference of GTX hardware in the profiler documentation.

FWIW, I can only execute the “command line” profiler.

Hello Space_Monkey,

That docuemnt is in the “doc” folder of your CUDA installation.