I hope you are doing good. I am trying to assess the performance of a multi-gpu implementation of my software. I use MPI to communicate between nodes each one having a CPU and a GTX480 GPU.
I have used the nvprof results on each node using the help provided in the NVVP documentation. This gives me a picture on which kernels are taking maximum time but does not give me details on each kernels’ performance for example bandwidth usage, load/store efficiency/throughput etc. which are available when you run a single GPU code inside NVVP.
I am wondering and consequently ask the members in this group for a better way to profile the multi-gpu(MPI) code i have in order to understand what each kernel might be lacking individually.
P.S.:- i have also tried the command line profiler but from the outset it has been told in the documentation that it cannot collect metrics like the NVVP.