Nvprof SM number usage and metrics profiling


I’m trying to profile a simple program by using nvprof. I have two questions:

  1. How can I see a kernel launches on which SMs. E.g., I have matmul kernel and with block grid [7x2x1], which metrics I should use to know the SMs that thread blocks are mapped to (or how many SMs are in use for the execution of a kernel)?

  2. I’m using multiple streams to check concurrent execution of kernels. I can observe concurrent execution with no-metric profiling i.e., nvprof ./program --stream 2 --times 1. But the concurrency failed to observe for metric profiling nvprof --metrics sm_efficiency ./program --stream 2 --times 1. So is there a way to also observe concurrecy during metrics profiling?

I use each CPU thread to maintain a single thread and keep launching small kernels (block grid [7x2x1] and each block has 1024 threads).

Some results for question 2:

nvprof with --metric profiling:
cmd: /usr/local/cuda/bin/nvprof --metrics sm_efficiency --concurrent-kernels on -f -o prof.nvvp ./program --stream 2 --times 1

nvprof without --metric profiling:
cmd: /usr/local/cuda/bin/nvprof --concurrent-kernels on -f -o prof.nvvp ./program --stream 2 --times 1

You could see concurrency is observed w/o profiling metrics but failed with profiling metrics.

System settings:
nvcc version 10.0
nvprof version 10.0.130
CUDA version 10.2