I’m trying to profile a simple program by using nvprof. I have two questions:
How can I see a kernel launches on which SMs. E.g., I have matmul kernel and with block grid [7x2x1], which metrics I should use to know the SMs that thread blocks are mapped to (or how many SMs are in use for the execution of a kernel)?
I’m using multiple streams to check concurrent execution of kernels. I can observe concurrent execution with no-metric profiling i.e.,
nvprof ./program --stream 2 --times 1. But the concurrency failed to observe for metric profiling
nvprof --metrics sm_efficiency ./program --stream 2 --times 1. So is there a way to also observe concurrecy during metrics profiling?
I use each CPU thread to maintain a single thread and keep launching small kernels (block grid [7x2x1] and each block has 1024 threads).
Some results for question 2:
nvprof with --metric profiling:
/usr/local/cuda/bin/nvprof --metrics sm_efficiency --concurrent-kernels on -f -o prof.nvvp ./program --stream 2 --times 1
nvprof without --metric profiling:
/usr/local/cuda/bin/nvprof --concurrent-kernels on -f -o prof.nvvp ./program --stream 2 --times 1
You could see concurrency is observed w/o profiling metrics but failed with profiling metrics.
nvcc version 10.0
nvprof version 10.0.130
CUDA version 10.2