CUDA Profiler documentation Few questions and some interesting facts

Sarnath · June 17, 2009, 8:18am

Few profiler related questions

QUESTION 1

I find this in the profiler documentation 2.2

The threadCountPerBlock is 32,8,8 == 2048 threads…

Did something change with GTX 280? I thought 512 was the limit for blocks…

QUESTION 2

Also the profiler mentions that

If the “incoherent” counter is always zero – how do we find how good our kernels are? Unless we have a ratio between Coherent and Incoherent - it is difficult to interpret meaningful information.

It is the same case on my TESLA as well (I checked minimally though). So, Is this known issue aplicable to TESLA C1060 as well ??

SOME FINDING

Also I find that the “instructions” counter increases with increase in iteration of FOR loops… And, there is a notable increase in “gld_coherent” values as well – indicating that instructions are stored in global memory with a small cache in the MP. This has been long discussed in the forums. Atleast, now we have some credible proof.

The kernel I checked did NOT have a single “global memory load” and I was seeing “gld_coherent” increasing with the increase of “instructions”

eyalhir74 · June 17, 2009, 9:14am

Hi,
Question 1: where do u see it in the visual profiler? I doubt it has changed.
Question 2: As far as I know, since you now have the Global memory read/write/overall throughput statistics
nVidia removed the incoherent stats. I think tmurray had a reasonable answer why (it was hard to interepret those
as well). I think the new method is not that great either. Take a look here: [url=“http://forums.nvidia.com/index.php?showtopic=99433”]http://forums.nvidia.com/index.php?showtopic=99433[/url]

As for the findings, did you have lmem access in the kernel? that might also explain that.

eyal

eyalhir74 · June 17, 2009, 9:51am

Question 2: As far as I know, since you now have the Global memory read/write/overall throughput statistics

 nVidia removed the incoherent stats. I think tmurray had a reasonable answer why (it was hard to interepret those

 as well). I think the new method is not that great either. Take a look here: <a target='_blank' rel='noopener noreferrer' href='"http://forums.nvidia.com/index.php?showtopic=99433"'>http://forums.nvidia.com/index.php?showtopic=99433</a>

Furtermore, I’m now trying to run some tuning on the application.

If I run certain input data I get for my main kernel the following statistics:

66.3GB/s read + 8.9GB/s write == 75.2GB/s overall → I guess this is very good for a C1060.

Now I run the same kernel on a different data and get:

20.3GB/s read + 26.1Gb/s write == 46.4GB/s overall

Why is it so different? what should I conclude about my kernel?

Its either a bug in my tests, since the data here shouldnt have played any role in the b/w I get,

or the numbers I get are problematic…

any suggestions ?

edit: Sarnath, sorry for hijacking your post :) … this is just another try to show/understand that the GB/s are

a bit confusing in my opinion.

thanks

eyal

Sarnath · June 17, 2009, 10:58am

I find that in the profiler documentation in “doc” directory “cuda_profiler_2_2.txt”

That is quoted as an example.

The profiler documentation snippet I postd above gives the reason. Due to the smart coalescing hardware, they are no more able to find in-coherency. I am just wondering if TESLA also comes under that cattegory.

No.

space_monkey · July 18, 2009, 5:17pm

Where is that document?

I could only find documentation for the 1.2 profiler.

CUDA Visual Profiler v1.2 Readme (even though it says 2.2 in the text)

In either case, I couldn’t find any reference of GTX hardware in the profiler documentation.

FWIW, I can only execute the “command line” profiler.

Sarnath · July 20, 2009, 6:22am

Hello Space_Monkey,

That docuemnt is in the “doc” folder of your CUDA installation.

Directory of E:\Program Files\CUDA\doc

06/04/2009 11:33 AM .

06/04/2009 11:33 AM …

05/02/2009 02:06 PM 1,118,296 CUBLAS_Library_2.2.pdf

06/04/2009 11:33 AM 254,638 CUDA2.2PinnedMemoryAPIs.pdf

05/02/2009 02:08 PM 666,186 CudaReferenceManual.chm

05/02/2009 02:06 PM 3,202,256 CudaReferenceManual.pdf

05/02/2009 02:06 PM 10,377 CUDA_Profiler_2.2.txt

05/02/2009 02:06 PM 10,349 CUDA_Release_Notes_2.2.txt

05/02/2009 02:06 PM 318,468 CUFFT_Library_2.2.pdf

05/02/2009 02:06 PM 12,732 EULA.txt

05/20/2009 11:43 AM html

05/02/2009 02:06 PM 777,116 nvcc_2.2.pdf

05/02/2009 02:06 PM 1,239,966 NVIDIA_CUDA_Programming_Guide_2.2.pdf

05/02/2009 02:06 PM 24,510 NVIDIA_CUDA_Programming_Guide_Revision_Hi

story.txt

05/02/2009 02:06 PM 2,269,296 ptx_isa_1.4.pdf

05/02/2009 02:06 PM 3,166 README.txt
          13 File(s)      9,907,356 bytes

           3 Dir(s)  18,087,718,912 bytes free

Topic		Replies	Views
URGENT: Weird CUDA profiler results...need help with analysis CUDA Programming and Performance	1	3142	June 18, 2009
Profiler coalescing counters On a GTX 260 CUDA Programming and Performance	4	2376	August 13, 2008
coalesce counter meaning CUDA Programming and Performance	5	4275	April 15, 2009
CUDA profiler & T10P CUDA Programming and Performance	3	868	May 15, 2008
Unexpected Profiler output, zeros for all global read/write CUDA Programming and Performance	3	1892	December 23, 2008
How to Verify Coalescing in new Devices CUDA Programming and Performance	4	1945	December 8, 2012
Profiler not reporting coalesced ld/st CUDA Programming and Performance	1	418	January 19, 2011
Understanding cudaProf's output CUDA Programming and Performance	2	1981	July 21, 2010
cuda profiler error about coalesced store CUDA Programming and Performance	2	1097	January 6, 2010
Cuda Profiler 1.1 - question on gst coalesced value CUDA Programming and Performance	1	1594	April 5, 2009

CUDA Profiler documentation Few questions and some interesting facts

Related topics