nvprof is too slow

Hi,

I use these nvprof options
nvprof -o analysis.prof_ROW.%h.%p.%q{OMPI_COMM_WORLD_RANK} --system-profiling on --print-gpu-trace --print-api-trace

with my mpirun command and the run takes about 14 minutes

when I try these options, nvprof runs for hours and still no output
nvprof -o analysis.prof_ROW.%h.%p.%q{OMPI_COMM_WORLD_RANK}.nvprof2
–aggregate-mode on --metrics l2_utilization,texture_utilization,system_utilization,dram_utilization,dram_read_throughput,dram_read_transactions,dram_write_throughput,dram_write_transactions,gld_efficiency,gld_throughput,gld_transactions,gld_transactions_per_request,global_cache_replay_overhead,gst_efficiency,gst_throughput,gst_transactions,gst_transactions_per_request,l1_cache_global_hit_rate,l1_cache_local_hit_rate,l1_shared_utilization,l2_atomic_throughput,l2_atomic_transactions,l2_l1_read_throughput,l2_l1_write_throughput,ldst_executed,local_memory_overhead,shared_efficiency,shared_store_throughput,sm_efficiency,sysmem_utilization,sysmem_write_throughput,tex_cache_throughput,tex_fu_utilization,tex_utilization,warp_execution_efficiency

i want to get information on the memory bandwidth. any suggestions? thanks

YAH

BTW, I’m using 4 nodes over OpenMPI. I’m just trying to profile one host. I have 2 K40 GPUs. I see that there is some activity

eplayed on device 0 in order to collect all events/metrics.
==28627== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
2017-05-24 08:06:28.368

but i don’t see much movement

09:23:53 up 22:27, 1 user, load average: 2.45, 2.17, 2.25
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
root pts/5 10.31.39.9 09:21 0.00s 0.01s 0.00s w

|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m Off | 0000:02:00.0 Off | 0 |
| N/A 40C P0 77W / 235W | 11249MiB / 11439MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K40m Off | 0000:84:00.0 Off | 0 |
| N/A 34C P0 78W / 235W | 11183MiB / 11439MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28627 C …ueue/may_2017_nvprof_performance/xhpl_GPU 11145MiB |
| 1 28626 C …ueue/may_2017_nvprof_performance/xhpl_GPU 11081MiB |
±----------------------------------------------------------------------------+

thanks

YAH

I tried again, this time with fewer options, it failed after 30 minutes

nvprof -o analysis.prof_ROW.%h.%p.%q{OMPI_COMM_WORLD_RANK}.nvprof2
–metrics l2_utilization,texture_utilization

[cn3005:11625] *** Process received signal ***
[cn3005:11626] *** Process received signal ***
[cn3005:11625] Signal: Bus error (7)
[cn3005:11625] Signal code: Non-existant physical address (2)
[cn3005:11625] Failing at address: 0x7ff367224000
[cn3005:11626] Signal: Bus error (7)
[cn3005:11626] Signal code: Non-existant physical address (2)
[cn3005:11626] Failing at address: 0x7ff348ae3000
[cn3005:11625] [ 0] [cn3005:11626] [ 0] /lib64/libpthread.so.0[0x3603a0f7e0]
[cn3005:11626] [ 1] /usr/lib64/libcuda.so.1(+0xfa407)[0x7fffe5559407]
[cn3005:11626] [ 2] /usr/lib64/libcuda.so.1(+0x19dd0e)[0x7fffe55fcd0e]
[cn3005:11626] [ 3] /usr/lib64/libcuda.so.1(+0x19df3b)[0x7fffe55fcf3b]
[cn3005:11626] [ 4] /lib64/libpthread.so.0[0x3603a0f7e0]
[cn3005:11625] [ 1] /usr/lib64/libcuda.so.1(+0xfa407)[0x7fffe5569407]
[cn3005:11625] [ 2] /usr/lib64/libcuda.so.1(+0x19dd0e)[0x7fffe560cd0e]
[cn3005:11625] [ 3] /usr/lib64/libcuda.so.1(+0x19df3b

do you guys have any suggestions?

thanks,

yah

Hi, yah

It will cost longer if you want to collect more metrics.

Here are several questions need you answer:

  1. Are you using by using mpirun -np 4 -host $hostname,$slavename1,$slavename2,$slavename3 nvprof -o output.%h.%p.%q{OMPI_COMM_WORLD_RANK} ./XXX

  2. Which toolkit are you using ?

  3. If possible, can you send me the sample you used for reproduce ?

  4. you also said : I’m just trying to profile one host. I have 2 K40 GPUs. What do you mean ? you do not need use mpirun ?

hi,

thanks for the reply.

i’m using cuda 7.5

here is my mpi command
mpirun -v -np NUM_MPI_PROCS --hostfile host.GPUs --mca btl_openib_want_fork_support 1 --mca btl openib,self --bind-to BIND --mca btl_openib_eager_limit EAGER_VALUE --mca btl_openib_max_send_size EAGER_VALUE runHPL.sh

here is the important portion of the runHPL.sh script. the below script works fine

case ${lrank} in
[0])
#uncomment next line to set GPU affinity of local rank 0
export CUDA_VISIBLE_DEVICES=0
#uncommen next line to set CPU affinity of local rank 0
numactl --cpunodebind=0 nvprof -o /scratch.global/yhuerta/k40runs/k40Queue/june_2_2017/performance_logs/analysis.prof_ROW_GOV.%h.%p.%q{OMPI_COMM_WORLD_RANK} --system-profiling on --print-api-trace --print-gpu-trace
$HPL_DIR/xhpl_GPU
;;
[1])
#uncomment next line to set GPU affinity of local rank 2
export CUDA_VISIBLE_DEVICES=1
#uncomment next line to set CPU affinity of local rank 2
numactl --cpunodebind=1 nvprof -o /scratch.global/yhuerta/k40runs/k40Queue/june_2_2017/performance_logs/analysis.prof_ROW_GOV.%h.%p.%q{OMPI_COMM_WORLD_RANK} --system-profiling on --print-gpu-trace --print-api-trace
$HPL_DIR/xhpl_GPU
;;
esac

when i add these options to runHPL.sh
–metrics l2_utilization,texture_utilization,system_utilization,dram_utilization,dram_read_throughput,dram_read_transactions,dram_write_throughput,dram_write_transactions,gld_efficiency,gld_throughput,gld_transactions,gld_transactions_per_request,global_cache_replay_overhead,gst_efficiency,gst_throughput,gst_transactions,gst_transactions_per_request,l1_cache_global_hit_rate,l1_cache_local_hit_rate,l1_shared_utilization,l2_atomic_throughput,l2_atomic_transactions,l2_l1_read_throughput,l2_l1_write_throughput,ldst_executed,local_memory_overhead,shared_efficiency,shared_store_throughput,sm_efficiency,sysmem_utilization,sysmem_write_throughput,tex_cache_throughput,tex_fu_utilization,tex_utilization,warp_execution_efficiency

i let it run for hours as supposed to just 15 minutes and i don’t see any output. i run on 4 nodes 2 gpus per node

thanks,

yah

Hi, yah

Thanks for the info.

As you said, the progress started, but you didn’t get results for hours.
I suppose this is a specific sample problem.

Have you ever tried to profile other samples to get these metrics, like 0_Simple/simpleMPI in the sdk, does it will also cost long time ?

Also I think you can tried not to add so many metrics in one time, just reduce some and see what happens.

PS: The latest toolkit already updated to 8.0

thanks for the suggestions. i’ll start small. i’ll test it on one node and then work my way up to 4

yah

I have a similar problem.
Running HPCG benchmark. One node with two M60 GPUs.
The command is

mpirun -np 2 nvprof  --metrics dram_read_throughput,dram_utilization,dram_write_throughput ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

Without nvprof it takes about 2 minutes to finish.
With nvprof I am already waiting for 3 hours and it is still working.

I also (as Yah) need information about memory bandwidth usage.

BTW, the following command doesn’t have the problem and finishes fast:

mpirun -np 2 nvprof --annotate-mpi openmpi ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

–metrics option slows down profiling dramatically.

I do not see much correlation with number of metrics: adding only one metric after --metrics makes nvprof work tens or even hundreds times slower.
It is not specific to MPI applications.

Hi,

I also notice that the program is very slow when profiling it with nvprof. And I run with:

eventSet=tex0_cache_sector_queries,tex1_cache_sector_queries,tex2_cache_sector_queries,tex3_cache_sector_queries
nvprof --events $eventSet --log-file nvoutput_%p.csv --csv python3 main.py

Actually only 4 events.

Why the program is very slow?