I fllowed https://www2.cs.sfu.ca/~kabanets/405/ and wrote a code of matrix multipication using shared memory. When I tried to profile this program usng nvproof, I met the following error:
nvprof --metrics shared_load_transactions_per_request,shared_store_transactions_per_request ./matrixMulShared
==8001== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "matrixMulGlobal(Matrix, Matrix, Matrix)" (2 of 2)...
Replaying kernel "matrixMulGlobal(Matrix, Matrix, Matrix)" (done)
==8001== Error: Internal profiling error 4168:7.
matrix multiplication on CPU: 18.087000 ms
======== Error: CUDA profiling error.
When runing nvprof without “metrics” arg, there is no error.
I also tried with other metrics such as gld_throughput, gld_efficiency, and the same error occurs.