Cannot profile kernel from CUDA samples

Hi, I am trying to profile the matrix multiplication kernel in /cuda-samples/Samples/0_Introduction/matrixMul/

I am currently getting this error

[Matrix Multiply Using CUDA] - Starting...

==PROF== Connected to process 117662 (/cuda-samples/Samples/0_Introduction/matrixMul/matrixMul)

GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)

CUDA error at matrixMul.cu:348 code=2(cudaErrorMemoryAllocation) "cudaProfilerStart()"

==PROF== Disconnected from process 117662

==ERROR== The application returned an error code (1).

==WARNING== No kernels were profiled.

==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

Can I ask how to fix this? Thanks!

Can you run the sample without Nsight Compute? Can you post the CLI of running it without Nsight Compute, then running it with Nsight Compute?

1 Like

Hi, this is the output for running without ncu :

$ ./matrixMul 

[Matrix Multiply Using CUDA] - Starting...

GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)

Computing result using CUDA Kernel...

done

Performance= 2213.51 GFlop/s, Time= 0.059 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block

Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

This is the output for running with ncu:

$ sudo /usr/local/cuda-11.4/bin/ncu -o profile ./matrixMul

[Matrix Multiply Using CUDA] - Starting...

==PROF== Connected to process 117381 (/gpu/cuda-samples/Samples/0_Introduction/matrixMul/matrixMul)

GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)

CUDA error at matrixMul.cu:348 code=2(cudaErrorMemoryAllocation) "cudaProfilerStart()"

==PROF== Disconnected from process 117381

==ERROR== The application returned an error code (1).

==WARNING== No kernels were profiled.

==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

Thanks for those details. I notice that you use “sudo” for the ncu part. Sometimes this can change paths/environment variables/etc… Can you verify that “$sudo ./matrixMul” will run okay?

Also, if your system is configured with permissions, you should be able to run ncu without sudo. What happens if you do that? If you have an issue, is it this one and are you able to enable permissions?

Lastly, I see that you’re using the Nsight Compute from CUDA 11.4, which is quite old. You can download and install just the latest Nsight Compute from the product page and use it with your existing CUDA installation. That is something else that is worth trying to see if it fixes the issue.

1 Like

Thanks for the reply.

It looks like sudo ./matrixMul does not work. Can I ask how I can fix this sudo path problem?

$ sudo ./matrixMul

[Matrix Multiply Using CUDA] - Starting...

GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)

CUDA error at matrixMul.cu:348 code=2(cudaErrorMemoryAllocation) "cudaProfilerStart()"

I ran into the exact error that you mentioned when running ncu ./matrixMul - that’s why I had to run with sudo. Unfortuntely I am not able to get permission.

$ ncu ./matrixMul

[Matrix Multiply Using CUDA] - Starting...

==PROF== Connected to process 92755 (/cuda-samples/Samples/0_Introduction/matrixMul/matrixMul)

GPU Device 0: "Ampere" with compute capability 8.6

MatrixA(320,320), MatrixB(640,320)

Computing result using CUDA Kernel...

==ERROR== Error: ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM

done

Performance= 747.78 GFlop/s, Time= 0.175 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block

Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

==PROF== Disconnected from process 92755

==WARNING== No kernels were profiled.

==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

Based on this, how should I go fix this? Thanks!

It’s not easy to say exactly what the issue is with the sudo profile. What I usually do is “sudo -i” to change to a superuser account or login as root and then try to get a CUDA application running correctly. It’s likely that something isn’t installed or configured correctly in the root/sudo environment. Once you figure out what wasn’t configured correctly, you may be able to use that to fix the “sudo ./matrixmul” command and get it to run correctly.

1 Like

Thanks alot for the reply. Running

$ sudo -i 
# /usr/local/cuda-11.4/bin/ncu ./matrixMul

fixed the problem. Thanks!