nvprof is too slow

yah · May 24, 2017, 1:36pm

Hi,

I use these nvprof options
nvprof -o analysis.prof_ROW.%h.%p.%q{OMPI_COMM_WORLD_RANK} --system-profiling on --print-gpu-trace --print-api-trace

with my mpirun command and the run takes about 14 minutes

when I try these options, nvprof runs for hours and still no output
nvprof -o analysis.prof_ROW.%h.%p.%q{OMPI_COMM_WORLD_RANK}.nvprof2
–aggregate-mode on --metrics l2_utilization,texture_utilization,system_utilization,dram_utilization,dram_read_throughput,dram_read_transactions,dram_write_throughput,dram_write_transactions,gld_efficiency,gld_throughput,gld_transactions,gld_transactions_per_request,global_cache_replay_overhead,gst_efficiency,gst_throughput,gst_transactions,gst_transactions_per_request,l1_cache_global_hit_rate,l1_cache_local_hit_rate,l1_shared_utilization,l2_atomic_throughput,l2_atomic_transactions,l2_l1_read_throughput,l2_l1_write_throughput,ldst_executed,local_memory_overhead,shared_efficiency,shared_store_throughput,sm_efficiency,sysmem_utilization,sysmem_write_throughput,tex_cache_throughput,tex_fu_utilization,tex_utilization,warp_execution_efficiency

i want to get information on the memory bandwidth. any suggestions? thanks

YAH

yah · May 24, 2017, 2:24pm

BTW, I’m using 4 nodes over OpenMPI. I’m just trying to profile one host. I have 2 K40 GPUs. I see that there is some activity

eplayed on device 0 in order to collect all events/metrics.
==28627== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
2017-05-24 08:06:28.368

but i don’t see much movement

09:23:53 up 22:27, 1 user, load average: 2.45, 2.17, 2.25
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
root pts/5 10.31.39.9 09:21 0.00s 0.01s 0.00s w

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28627 C …ueue/may_2017_nvprof_performance/xhpl_GPU 11145MiB |
| 1 28626 C …ueue/may_2017_nvprof_performance/xhpl_GPU 11081MiB |
±----------------------------------------------------------------------------+

thanks

YAH

yah · May 24, 2017, 2:55pm

I tried again, this time with fewer options, it failed after 30 minutes

nvprof -o analysis.prof_ROW.%h.%p.%q{OMPI_COMM_WORLD_RANK}.nvprof2
–metrics l2_utilization,texture_utilization

[cn3005:11625] *** Process received signal ***
[cn3005:11626] *** Process received signal ***
[cn3005:11625] Signal: Bus error (7)
[cn3005:11625] Signal code: Non-existant physical address (2)
[cn3005:11625] Failing at address: 0x7ff367224000
[cn3005:11626] Signal: Bus error (7)
[cn3005:11626] Signal code: Non-existant physical address (2)
[cn3005:11626] Failing at address: 0x7ff348ae3000
[cn3005:11625] [ 0] [cn3005:11626] [ 0] /lib64/libpthread.so.0[0x3603a0f7e0]
[cn3005:11626] [ 1] /usr/lib64/libcuda.so.1(+0xfa407)[0x7fffe5559407]
[cn3005:11626] [ 2] /usr/lib64/libcuda.so.1(+0x19dd0e)[0x7fffe55fcd0e]
[cn3005:11626] [ 3] /usr/lib64/libcuda.so.1(+0x19df3b)[0x7fffe55fcf3b]
[cn3005:11626] [ 4] /lib64/libpthread.so.0[0x3603a0f7e0]
[cn3005:11625] [ 1] /usr/lib64/libcuda.so.1(+0xfa407)[0x7fffe5569407]
[cn3005:11625] [ 2] /usr/lib64/libcuda.so.1(+0x19dd0e)[0x7fffe560cd0e]
[cn3005:11625] [ 3] /usr/lib64/libcuda.so.1(+0x19df3b

do you guys have any suggestions?

thanks,

yah

veraj · June 2, 2017, 5:13am

Hi, yah

It will cost longer if you want to collect more metrics.

Here are several questions need you answer:

Are you using by using mpirun -np 4 -host $hostname,$slavename1,$slavename2,$slavename3 nvprof -o output.%h.%p.%q{OMPI_COMM_WORLD_RANK} ./XXX
Which toolkit are you using ?
If possible, can you send me the sample you used for reproduce ?
you also said : I’m just trying to profile one host. I have 2 K40 GPUs. What do you mean ? you do not need use mpirun ?

yah · June 2, 2017, 5:09pm

hi,

thanks for the reply.

i’m using cuda 7.5

here is my mpi command
mpirun -v -np NUM_MPI_PROCS --hostfile host.GPUs --mca btl_openib_want_fork_support 1 --mca btl openib,self --bind-to BIND --mca btl_openib_eager_limit EAGER_VALUE --mca btl_openib_max_send_size EAGER_VALUE runHPL.sh

here is the important portion of the runHPL.sh script. the below script works fine

case ${lrank} in
[0])
#uncomment next line to set GPU affinity of local rank 0
export CUDA_VISIBLE_DEVICES=0
#uncommen next line to set CPU affinity of local rank 0
numactl --cpunodebind=0 nvprof -o /scratch.global/yhuerta/k40runs/k40Queue/june_2_2017/performance_logs/analysis.prof_ROW_GOV.%h.%p.%q{OMPI_COMM_WORLD_RANK} --system-profiling on --print-api-trace --print-gpu-trace
$HPL_DIR/xhpl_GPU
;;
[1])
#uncomment next line to set GPU affinity of local rank 2
export CUDA_VISIBLE_DEVICES=1
#uncomment next line to set CPU affinity of local rank 2
numactl --cpunodebind=1 nvprof -o /scratch.global/yhuerta/k40runs/k40Queue/june_2_2017/performance_logs/analysis.prof_ROW_GOV.%h.%p.%q{OMPI_COMM_WORLD_RANK} --system-profiling on --print-gpu-trace --print-api-trace
$HPL_DIR/xhpl_GPU
;;
esac

when i add these options to runHPL.sh
–metrics l2_utilization,texture_utilization,system_utilization,dram_utilization,dram_read_throughput,dram_read_transactions,dram_write_throughput,dram_write_transactions,gld_efficiency,gld_throughput,gld_transactions,gld_transactions_per_request,global_cache_replay_overhead,gst_efficiency,gst_throughput,gst_transactions,gst_transactions_per_request,l1_cache_global_hit_rate,l1_cache_local_hit_rate,l1_shared_utilization,l2_atomic_throughput,l2_atomic_transactions,l2_l1_read_throughput,l2_l1_write_throughput,ldst_executed,local_memory_overhead,shared_efficiency,shared_store_throughput,sm_efficiency,sysmem_utilization,sysmem_write_throughput,tex_cache_throughput,tex_fu_utilization,tex_utilization,warp_execution_efficiency

i let it run for hours as supposed to just 15 minutes and i don’t see any output. i run on 4 nodes 2 gpus per node

thanks,

yah

veraj · June 5, 2017, 3:11am

Hi, yah

Thanks for the info.

As you said, the progress started, but you didn’t get results for hours.
I suppose this is a specific sample problem.

Have you ever tried to profile other samples to get these metrics, like 0_Simple/simpleMPI in the sdk, does it will also cost long time ?

Also I think you can tried not to add so many metrics in one time, just reduce some and see what happens.

PS: The latest toolkit already updated to 8.0

yah · June 5, 2017, 6:26pm

thanks for the suggestions. i’ll start small. i’ll test it on one node and then work my way up to 4

yah

pyotr777 · November 29, 2017, 5:07am

I have a similar problem.
Running HPCG benchmark. One node with two M60 GPUs.
The command is

mpirun -np 2 nvprof  --metrics dram_read_throughput,dram_utilization,dram_write_throughput ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

Without nvprof it takes about 2 minutes to finish.
With nvprof I am already waiting for 3 hours and it is still working.

I also (as Yah) need information about memory bandwidth usage.

BTW, the following command doesn’t have the problem and finishes fast:

mpirun -np 2 nvprof --annotate-mpi openmpi ./xhpcg-3.1_gcc_485_cuda90176_ompi_1_10_2_sm_35_sm_50_sm_60_sm_70_ver_10_8_17

pyotr777 · January 18, 2018, 1:30pm

–metrics option slows down profiling dramatically.

I do not see much correlation with number of metrics: adding only one metric after --metrics makes nvprof work tens or even hundreds times slower.
It is not specific to MPI applications.

lingfenglin · June 25, 2018, 3:12am

Hi,

I also notice that the program is very slow when profiling it with nvprof. And I run with:

eventSet=tex0_cache_sector_queries,tex1_cache_sector_queries,tex2_cache_sector_queries,tex3_cache_sector_queries
nvprof --events $eventSet --log-file nvoutput_%p.csv --csv python3 main.py

Actually only 4 events.

Why the program is very slow?

csaparna · January 6, 2022, 3:36pm

I face the same issue. I am running nvprof on Jetson Xavier AGX. I am using it on a neural network inference with the tags -o and --analysis-metrics for exporting to visual profiler. The script has been running for more than 12 hours! Previously I used nvprof to export the timeline of this script without any issues.
Is it useful to specify the kernel option? How long is it estimated to profile all the metrics? Thanks in advance for any guidance.

mjain · January 25, 2022, 1:12pm

NVIDIA Visual Profiler and nvprof use CUPTI for providing tracing and profiling information. All the profiling limitations mentioned in the CUPTI section Profiling Overhead apply to Visual Profiler and nvprof.

Listing few of those for quick reference:

Profiling tools serialize all the kernels in the application, thus profiling may significantly change the overall performance characteristics of the application.
When all the requested events or metrics cannot be collected in the single pass due to the hardware or software limitations, kernel or application is replayed multiple times for collection.
Software events and metrics are expensive as these are collected using kernel instrumentation. Collection of software events and metrics is more expensive compared to the hardware events and metrics.

It is suggested to limit the scope of profiling to a small set of kernels. These can be achieved using the nvprof option --kernels. Refer to the Profiling Scope section of the Profiler User Guide for more details.

csaparna · January 25, 2022, 6:12pm

Thanks for the reply. I figured this out, and used the timeline to select which kernel I want more metrics on. Then I profiled all metrics for just that kernel. Good to know that this was the right direction.

Topic		Replies	Views
nvprof with tensorflow is suspiciously slow CUDA Programming and Performance	7	1605	January 19, 2019
Why would code run 1.7x faster when run with nvprof than without? CUDA Programming and Performance	35	3589	December 28, 2017
nvprof --metrics branch_efficiency..... Why no metrics ? Visual Profiler and nvprof	3	1780	December 14, 2019
nvprof: timelines for GPU metrics values. --metrics and --print-gpu-trace options. Visual Profiler and nvprof	4	1821	January 22, 2018
nvprof: Internal profiling error 4277:5 on Tesla P100, but not on GTX 1070 Visual Profiler and nvprof	12	4126	October 12, 2021
nvprof doesn't generate any output data CUDA Programming and Performance	0	1128	September 23, 2018
nvprof --print-api-trace - puzzling outputs. Visual Profiler and nvprof	0	667	January 7, 2020
How do i get some of the nvprof metrics in insight? Nsight Compute	0	773	June 2, 2021
preview of NVIDIA Visual Profiler CUDA Programming and Performance	76	89552	May 18, 2010
Option to profile only master process Nsight Compute cuda	23	3917	December 1, 2023

nvprof is too slow

Related topics