Time To Profile

Is there a way to control the number of runs the Visual Profiler requires to profile an application?

I’m trying to profile an application that takes a minute or two to run. The profiler runs it 12 times, so it takes nearly 1/2 hour to do one profiling run. Unfortunately, the results of a half-our run are usually not very informative - the profiler records 20,002 rows on all but a couple of runs (runs that fail or exceed the allotted runtime), and it drops 19,995 of those rows.

I’d like to be able to:

  1. Profile single runs. Yes, the results are not as statistically significant, but they can still be very useful.

  2. Instead of dropping the rows that are not there in every run, drop the runs that do not complete successfully.

  3. Get useful profiling information on my code in some way. I’d be interested in alternate suggestions.

Thanks!

Use the profiler without the GUI.
If you set CUDA_PROFILE=1, the driver will generate a complete log file. You can select the counters you want to analyze.
Look at the manual/readme for more info.

OK - thanks. I’ll try that. Since the CLI documentation in the Users’ Guide is only 10 pages at the end, I thought that the preferred approach would be to use the GUI, and that I was missing something. I’d still love to be able to use the profiler through the GUI, with its automated analysis.

I’m now using the command-line interface. I keep getting messages like the following in my output:

NV_Warning: Counter <91>l1_global_load_hit<92> is not compatible with other selected counters and it cannot be profiled in this run.

Where can I find documentation on what counters are compatible with what other counters? Where can I find documentation on how many counters are available on a particular device?

Does the GUI run the programs multiple times because I have selected counters that cannot be collected at the same time?

Thanks!

dbr

So - clearly the GUI driver runs the application multiple times, with different settings each time. Is there a way to find out what parameters it is using on each run? Some of the profiling settings consistently cause my application to crash, and I’d like to avoid that.

Thanks,

David

I’d like to be able to:

  1. Profile single runs. Yes, the results are not as statistically significant, but they can still be very useful.
    You can disable all profiler counters (In the “Session settings dialog” under the “Profiler Counters” tab).
    In this case the application will be run only one time & only basic timing information & other options for kernels and memory copies will be collected. Note that in this case most of the “automated analysis” features will not be available as they require profiler counters.
  1. Instead of dropping the rows that are not there in every run, drop the runs that do not complete successfully.
    Does your application use multiple streams? Visual Profiler version 4.0 has a bug due to which rows are dropped in this case.
    Refer The Official NVIDIA Forums | NVIDIA

You should check why some runs are failing? In case the execution timeout is reached you can try increasing the timeout limit (the default is 30 seconds). Also note that the following counters can cause GPU kernels to run longer than the Timeout Detection and Recovery (TDR) limit.
“gld instructions 8/16/32/64/128 bit”
“gst instructions 8/16/32/64/128 bit”
In this case the application will terminate with an error and profiling data will not be available. Disabling the TDR timeout is recommended when these counters are selected. Or you can disable these counters.

Where can I find documentation on how many counters are available on a particular device?
You can refer the “profiler counters” section in the Compute Visual Profiler User Guide.

Does the GUI run the programs multiple times because I have selected counters that cannot be collected at the same time?
Yes Visual Profiler runs the program multiple times to collect data for all the selected counters.
To find which counters are collected in each run you can save the project and look at the “.cvp” project file.
Pasting information from the Visual Profiler 4.0 sample project “analysis.cvp”:
gld_inst_8bit,gld_inst_16bit,gld_inst_32bit,gld_inst_64bit,gld_inst_128bit
gst_inst_8bit,gst_inst_16bit,gst_inst_32bit,gst_inst_64bit,gst_inst_128bit
local_load,local_store,gld_request,gst_request
shared_load,shared_store,sm_cta_launched,l1_local_load_hit
branch,l1_local_load_miss,l1_local_store_hit
divergent_branch,l1_local_store_miss,l1_global_load_hit
inst_issued,l1_global_load_miss
inst_executed,uncached_global_load_transaction,global_store_transaction
warps_launched,threads_launched,l1_shared_bank_conflict
active_warps
active_cycles
Note that run number 1 is used to collect basic timing information and no counters are enabled in this run.

Thank you. That was quite helpful. While the Visual Profiler Users Guide does not seem to list the number of counters available, it does list the things that they can count. It seems that the global loads and stores, qualified by size, are not available with the compute architecture 2.0, but only with 2.1. That is apparently what was causing runs 2 and 3 of my program under the profiler to crash every time.

Thank you also for the pointer to the .cvp file. That’s very helpful for understanding just what the GUI is doing.

David

The “gld instructions 8/16/32/64/128 bit” and “gst instructions 8/16/32/64/128 bit” counters are available for GPU devices with compute capability 2.0 and 2.1. (Also Visual Profiler will not allow you to select counters which are not supported).

So for these counters you are either:
a) hitting the application timeout limit (you can try increasing the value in the session settings dialog) or
b) the kernel is hitting the TDR limit (you are try disabling TDR timeout).

You can find out based on the message in the Visual Profiler output window for application runs 2 and 3.

Thank you for your help on this. Please bear with me a bit longer - I’m still confused, and trying to work through this.

I didn’t think that the Visual Profiler would allow me to select counters which are not supported, either. However, the table on page 64 of “Compute Visual Profiler Users Guide”, DU-05162-001_v04, says that gld{32,64,128}b and gst{32,64,128}b are supported in 2.0 but not 2.1.

Ah - here’s why I’m confused about this. The gld and gst instruction counters are on page 67-8. The counters that are not supported in 2.0 are the transaction counters on page 64.

So, the global load/store instruction counters should work. However, when I enable them, my software exits with messages like the following, which do not look to me like timeouts. Are these really timeouts, or is something else likely to be going on here?

I’m running under Scientific Linux 6.1, 64-bit, with current updates, on a GeForce GTX 480. I think that the TDR limits are Windows-only, and the runs that die under the profiler do so quite quickly.

*** glibc detected *** <path to my executable>: free(): invalid pointer: 0x00007f8145336100 ***

======= Backtrace: =========

/lib64/libc.so.6[0x30c5675156]

/usr/lib64/nvidia/libcuda.so(+0xeea54)[0x7f81454f6a54]

/usr/lib64/nvidia/libcuda.so(+0xeef8a)[0x7f81454f6f8a]

/usr/lib64/nvidia/libcuda.so(+0xe0d5e)[0x7f81454e8d5e]

/usr/lib64/nvidia/libcuda.so(+0xf1495)[0x7f81454f9495]

/usr/lib64/nvidia/libcuda.so(+0xc2fcc)[0x7f81454cafcc]

/usr/lib64/nvidia/libcuda.so(cuLaunchKernel+0xcb)[0x7f81454ba30b]

/usr/local/cuda/lib64/libcudart.so.4(+0xeec3)[0x7f8145e23ec3]

/usr/local/cuda/lib64/libcudart.so.4(cudaLaunch+0x63)[0x7f8145e49963]

<path to my executable>[0x402ff3]

<path to my executable>[0x404381]

<path to my executable>[0x401a45]

/lib64/libc.so.6(__libc_start_main+0xfd)[0x30c561ec9d]

<path to my executable>[0x401679]

======= Memory map: ========

00400000-00430000 r-xp 00000000 fd:05 13895367                           <path to my executable>

0062f000-00631000 rw-p 0002f000 fd:05 13895367                           <path to my executable>

00631000-00634000 rw-p 00000000 00:00 0 

016ce000-0194b000 rw-p 00000000 00:00 0                                  [heap]

200000000-200300000 ---p 00000000 00:00 0 

200300000-200400000 rw-p 00000000 00:00 0 

200400000-200700000 ---p 00000000 00:00 0 

200700000-200801000 rw-p 00000000 00:00 0 

200801000-800000000 ---p 00000000 00:00 0 

30c4e00000-30c4e20000 r-xp 00000000 fd:01 2884009                        /lib64/ld-2.12.so

30c501f000-30c5020000 r--p 0001f000 fd:01 2884009                        /lib64/ld-2.12.so

30c5020000-30c5021000 rw-p 00020000 fd:01 2884009                        /lib64/ld-2.12.so

30c5021000-30c5022000 rw-p 00000000 00:00 0 

30c5600000-30c5786000 r-xp 00000000 fd:01 2884010                        /lib64/libc-2.12.so

30c5786000-30c5985000 ---p 00186000 fd:01 2884010                        /lib64/libc-2.12.so

30c5985000-30c5989000 r--p 00185000 fd:01 2884010                        /lib64/libc-2.12.so

30c5989000-30c598a000 rw-p 00189000 fd:01 2884010                        /lib64/libc-2.12.so

30c598a000-30c598f000 rw-p 00000000 00:00 0 

30c5a00000-30c5a83000 r-xp 00000000 fd:01 2884023                        /lib64/libm-2.12.so

30c5a83000-30c5c82000 ---p 00083000 fd:01 2884023                        /lib64/libm-2.12.so

30c5c82000-30c5c83000 r--p 00082000 fd:01 2884023                        /lib64/libm-2.12.so

30c5c83000-30c5c84000 rw-p 00083000 fd:01 2884023                        /lib64/libm-2.12.so

30c5e00000-30c5e02000 r-xp 00000000 fd:01 2884016                        /lib64/libdl-2.12.so

30c5e02000-30c6002000 ---p 00002000 fd:01 2884016                        /lib64/libdl-2.12.so

30c6002000-30c6003000 r--p 00002000 fd:01 2884016                        /lib64/libdl-2.12.so

30c6003000-30c6004000 rw-p 00003000 fd:01 2884016                        /lib64/libdl-2.12.so

30c6200000-30c6217000 r-xp 00000000 fd:01 2884011                        /lib64/libpthread-2.12.so

30c6217000-30c6416000 ---p 00017000 fd:01 2884011                        /lib64/libpthread-2.12.so

30c6416000-30c6417000 r--p 00016000 fd:01 2884011                        /lib64/libpthread-2.12.so

30c6417000-30c6418000 rw-p 00017000 fd:01 2884011                        /lib64/libpthread-2.12.so

30c6418000-30c641c000 rw-p 00000000 00:00 0 

30c6600000-30c6615000 r-xp 00000000 fd:01 2884022                        /lib64/libz.so.1.2.3

30c6615000-30c6814000 ---p 00015000 fd:01 2884022                        /lib64/libz.so.1.2.3

30c6814000-30c6815000 rw-p 00014000 fd:01 2884022                        /lib64/libz.so.1.2.3

30c6a00000-30c6a07000 r-xp 00000000 fd:01 2884012                        /lib64/librt-2.12.so

30c6a07000-30c6c06000 ---p 00007000 fd:01 2884012                        /lib64/librt-2.12.so

30c6c06000-30c6c07000 r--p 00006000 fd:01 2884012                        /lib64/librt-2.12.so

30c6c07000-30c6c08000 rw-p 00007000 fd:01 2884012                        /lib64/librt-2.12.so

30d0e00000-30d0e16000 r-xp 00000000 fd:01 2884050                        /lib64/libgcc_s-4.4.5-20110214.so.1

30d0e16000-30d1015000 ---p 00016000 fd:01 2884050                        /lib64/libgcc_s-4.4.5-20110214.so.1

30d1015000-30d1016000 rw-p 00015000 fd:01 2884050                        /lib64/libgcc_s-4.4.5-20110214.so.1

30d2200000-30d22e8000 r-xp 00000000 fd:01 791154                         /usr/lib64/libstdc++.so.6.0.13

30d22e8000-30d24e8000 ---p 000e8000 fd:01 791154                         /usr/lib64/libstdc++.so.6.0.13

30d24e8000-30d24ef000 r--p 000e8000 fd:01 791154                         /usr/lib64/libstdc++.so.6.0.13

30d24ef000-30d24f1000 rw-p 000ef000 fd:01 791154                         /usr/lib64/libstdc++.so.6.0.13

30d24f1000-30d2506000 rw-p 00000000 00:00 0 

7f8143adc000-7f8143add000 rw-s 00000000 00:04 9797663                    /SYSV00007921 (deleted)

7f8143add000-7f8143ade000 rw-p 00000000 00:00 0 

7f8143ade000-7f8143bde000 rw-s 2c301e000 00:05 9831                      /dev/nvidia1

7f8143bde000-7f8143cde000 rw-s 2e73c9000 00:05 9831                      /dev/nvidia1

7f8143cde000-7f8143dde000 rw-s 31970c000 00:05 9831                      /dev/nvidia1

7f8143dde000-7f8143ede000 rw-p 00000000 00:00 0 

7f8143ede000-7f8143fde000 rw-s 2e7163000 00:05 9831                      /dev/nvidia1

7f8143fde000-7f81440de000 rw-p 00000000 00:00 0 

7f81440de000-7f81440df000 rw-s efee2000 00:05 9831                       /dev/nvidia1

7f81440df000-7f81440e0000 rw-s 2e62bf000 00:05 9831                      /dev/nvidia1

7f81440e0000-7f81444e2000 rw-s 3173b9000 00:05 9831                      /dev/nvidia1

7f81444e2000-7f81444e3000 rw-s efee1000 00:05 9831                       /dev/nvidia1

7f81444e3000-7f81444e4000 rw-s 2f7b8d000 00:05 9831                      /dev/nvidia1

7f81444e4000-7f81448e6000 rw-s 2f202f000 00:05 9831                      /dev/nvidia1

7f81448e6000-7f8144907000 rw-p 00000000 00:00 0 

7f8144907000-7f8144908000 ---p 00000000 00:00 0 

7f8144908000-7f8145308000 rwxp 00000000 00:00 0 

7f8145308000-7f8145408000 rw-p 00000000 00:00 0 

7f8145408000-7f8145af4000 r-xp 00000000 fd:01 929366                     /usr/lib64/nvidia/libcuda.so.285.05.09

7f8145af4000-7f8145cf3000 ---p 006ec000 fd:01 929366                     /usr/lib64/nvidia/libcuda.so.285.05.09

7f8145cf3000-7f8145dcf000 rw-p 006eb000 fd:01 929366                     /usr/lib64/nvidia/libcuda.so.285.05.09

7f8145dcf000-7f8145df4000 rw-p 00000000 00:00 0 

7f8145e0e000-7f8145e15000 rw-p 00000000 00:00 0 

7f8145e15000-7f8145e64000 r-xp 00000000 fd:01 926313                     /usr/local/cuda/lib64/libcudart.so.4.0.17

7f8145e64000-7f8146064000 ---p 0004f000 fd:01 926313                     /usr/local/cuda/lib64/libcudart.so.4.0.17

7f8146064000-7f8146065000 rw-p 0004f000 fd:01 926313                     /usr/local/cuda/lib64/libcudart.so.4.0.17

7f8146065000-7f8146066000 rw-p 00000000 00:00 0 

7f814607d000-7f814607e000 r--s f2009000 00:05 9831                       /dev/nvidia1

7f814607e000-7f814607f000 r--s f4009000 00:05 9829                       /dev/nvidia0

7f814607f000-7f8146081000 rw-p 00000000 00:00 0 

7fff03a27000-7fff03b1d000 rwxp 00000000 00:00 0                          [stack]

7fff03b1d000-7fff03b1f000 rw-p 00000000 00:00 0 

7fff03bff000-7fff03c00000 r-xp 00000000 00:00 0                          [vdso]

ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Program run #2 completed.

This is a CUDA driver crash. Please log a bug with details or you can send me an email with details.

Please provide details of the NVIDIA driver version and CUDA toolkit version you are using.

Thanks.

OK - thanks - will do. Just for reference, I’m running kernel 2.6.32-131.17.1.el6.x86_64, with driver nvidia-x11-drv-285.05.09-1.el6.elrepo.x86_64, and the CUDA 4.0 dev kit. Its runtime links are as follows:

ldd a.out

	linux-vdso.so.1 =>  (0x00007fff6a700000)

	libcudart.so.4 => /usr/local/cuda/lib64/libcudart.so.4 (0x00007f3b584c6000)

	libm.so.6 => /lib64/libm.so.6 (0x00000030c5a00000)

	libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00000030d2200000)

	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000030d0e00000)

	libc.so.6 => /lib64/libc.so.6 (0x00000030c5600000)

	libdl.so.2 => /lib64/libdl.so.2 (0x00000030c5e00000)

	libpthread.so.0 => /lib64/libpthread.so.0 (0x00000030c6200000)

	librt.so.1 => /lib64/librt.so.1 (0x00000030c6a00000)

	/lib64/ld-linux-x86-64.so.2 (0x00000030c4e00000)

I have been unable to isolate a reasonable test case to demonstrate this bug. I’m attaching the output of nvidia-bug-report.sh, in case that is useful even without a test case.

dbr
nvidia-bug-report.log.gz (63 KB)