nvprof fails with error 4168:7

I fllowed https://www2.cs.sfu.ca/~kabanets/405/ and wrote a code of matrix multipication using shared memory. When I tried to profile this program usng nvproof, I met the following error:

nvprof --metrics shared_load_transactions_per_request,shared_store_transactions_per_request ./matrixMulShared
==8001== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "matrixMulGlobal(Matrix, Matrix, Matrix)" (2 of 2)...
Replaying kernel "matrixMulGlobal(Matrix, Matrix, Matrix)" (done)
==8001== Error: Internal profiling error 4168:7.
matrix multiplication on CPU: 18.087000 ms
======== Error: CUDA profiling error.

When runing nvprof without “metrics” arg, there is no error.

I also tried with other metrics such as gld_throughput, gld_efficiency, and the same error occurs.

  • Device: GeForce GTX TITAN X
  • System: Ubuntu 16.04.6 LTS
  • CUDA 9.1
  • The code I'm using is here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch5/matrixMulShared.cu
  • It may be a corrupted machine configuration (broken CUDA install) or it might be a problem (bug) with nvprof.

    Bugs get fixed all the time. You might want to try a newer CUDA version.