how to collect GPU statistics ?

Hellow. i need to collect some GPU statistics like memory usage, processors usage (in percent for example) - i.e. statistics of how fully my program uses GPU.

how can i do that?
can Nvidia PerfHUD or some-thing else help me ?

Tnanks.

AFAIK, nvidia occupancy calculator, visual profiler and your calculations are the current tools for that task.

The most simple way to get performance info is to set the CUDA_PROFILE environment variable to 1 (Which puts the driver into debug mode)

If you start your program, a file containig statistical data is then generated automatically - it’s named cuda_profile.log and looks inside like this:

method=[ _Z15cudaClearScreenP6Render ] gputime=[ 2651.808 ] cputime=[ 2710.502 ] occupancy=[ 1.000 ]
method=[ _Z10cudaRenderP6Render ] gputime=[ 21652.098 ] cputime=[ 21708.760 ] occupancy=[ 0.167 ]
method=[ _Z13cudaDrawLinesP6Render ] gputime=[ 4307.584 ] cputime=[ 4441.886 ] occupancy=[ 0.667 ]

However, it does not tell you how much registers and how much memory are used and also the occupancy is only roughly told. In my case it didnt help me much as it doesnt contain detailed info per processor - also I need to know which part of my program takes how much time. Perhaps there , performance counters are available …?

[url=“http://developer.download.nvidia.com/compute/cuda/Profiler/0.2/CudaVisualProfiler_README_0.2_beta.txt”]http://developer.download.nvidia.com/compu...ME_0.2_beta.txt[/url]

This might be interesting as well.

-Sven

CUDA_PROFILE=1 - yes, that’s what i need!
i need not to optimise some-thing right now, just get a short report of GPU-occupancy.

i use command set CUDA_PROFILE=1. Strange, but it doesn’t work.
no *.log files have appeared.

i don’t know why.

i have done this throw vs properties. thanks a lot.

this is funny, but it always show occupancy=[ 0.167 ]

To optimize this, you should try to reduce the divergent jumps and use as many threads as possible (which depends on the number of registers).

Also the number of total thread-calls is important. I think it first gets efficient with more than 8192 calls