Run ncu command in ubuntu 20.04

Hi,

I have trouble running ncu command in ubuntu 20.04. I’ve already added ncu path to secure_path to prevent command not found using sudo. But when I run sudo ncu ./myscript, I got these error messages:

==PROF== Connected to process 2410 (/path/to/myscript)
==ERROR== Unknown Error on device 0.
==PROF== Disconnected from process 2410
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

My cuda version:
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

ncu version:
Version 2022.1.1.0 (build 30914944) (public-release)

Much appreciated if anyone can help.

Can you please confirm that your command (“./myscript”) runs correctly without profiling on the same setup?

Are you launching a CUDA application from the script?

As suggested in the ncu warning message you will need to use the ncu “–target-processes” option to profile kernels launched by child processes.

Hi Sanjiv,
Thanks for your reply, I used hello world scipt and run with ncu, still had same problem.
Here is my hello world script:
hello.cu

#include<stdio.h>
#include<stdlib.h>
__global__ void print_from_gpu(void) {
    printf("Hello World! from thread [%d, %d] from device\n", threadIdx.x, blockIdx.x);
}
int main(void) {
    printf("Hello World from host!\n");
    print_from_gpu<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}

normal run:

nvcc -o hello hello.cu
./hello

Hello World from host!
Hello World! from thread [0, 0] from device

run with ncu + sudo:

nvcc -o hello hello.cu
sudo ncu ./hello
Hello World from host!
==PROF== Connected to process 23597 (/home/frank/Desktop/cosine_canberra/hello)
==ERROR== Unknown Error on device 0.
Hello World! from thread [0, 0] from device
==PROF== Disconnected from process 23597
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

run with ncu:

nvcc -o hello hello.cu
ncu ./hello
Hello World from host!
==PROF== Connected to process 23576 (/home/frank/Desktop/cosine_canberra/hello)
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
Hello World! from thread [0, 0] from device
==PROF== Disconnected from process 23576
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I followed the instructions from the above link https://developer.nvidia.com/ERR_NVGPUCTRPERM. However, when I tried to unload dependent modules by running “modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia-vgpu-vfio nvidia”, I received “modprobe fatal module nvidia_modeset is in use”, which stopped me from following the rest of instructions.

Hello,

The fact that you still have an issue when running ncu with sudo seems weird as the root user should be able to access the necessary OS features to be able to profile.

Is there anything special about your setup (running in container, sudo configuration, etc.)?

What GPU are you trying to profile on?

when I tried to unload dependent modules by running “modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia-vgpu-vfio nvidia”, I received “modprobe fatal module nvidia_modeset is in use”

If this happened when you had already switched to the multi-user systemd target as suggested in the documentation, it seems that some processes with handles to the GPU were not killed. You can try to identify them with the lsof command by running:

$ sudo lsof /dev/nvidia*

If you are allowed to enable profiling for regular users in a persistent way and wish to do that without needing to figure out the modprobe issue, you can try the following (the /etc/modprobe.d directory is distribution dependent but that should work on Ubuntu):

$ echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee /etc/modprobe.d/ncu-profiling-normal-users.conf
$ sudo update-initramfs -c -k all
$ systemctl reboot

The first command will create a file that will cause the nvidia kernel module to be loaded with an additional option allowing profiling for non-root users.

The second one causes your initrd to be regenerated which is probably needed since the nvidia module is often loaded early at boot time.

When you reboot the machine, you should be able to profile applications as your normal user. Of course, this allows all users to access performance counters on the GPU which can be a security issue as it can reveal information about what is running on the GPU.

If ever you want to disable this later on, all you would need to do is remove the file in /etc/modprobe.d, regenerate your initrd and reboot.

Hi Gduque:
I run my program directly in my local PC without containers.
Here is the content of my /etc/sudoers file (All comments are removed):

Defaults        env_reset
Defaults        mail_badpass
Defaults        secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/local/cuda-11.6/bin"
Defaults        env_keep += "PATH"

root    ALL=(ALL:ALL) ALL
%admin ALL=(ALL) ALL
%sudo   ALL=(ALL:ALL) ALL

Other than that, I believe there is no other special about my setup.

Here is my GPU info:

sudo lshw -C display
  *-display                 
       description: VGA compatible controller
       product: GP102 [GeForce GTX 1080 Ti]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:1a:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:95 memory:b4000000-b4ffffff memory:a0000000-afffffff memory:b0000000-b1ffffff ioport:7000(size=128) memory:b5000000-b507ffff
  *-display
       description: VGA compatible controller
       product: GP102 [GeForce GTX 1080 Ti]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:68:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:96 memory:d7000000-d7ffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:b000(size=128) memory:c0000-dffff

I listed the processes but cannot kill them, here is the screenshot:

I followed your commands and rebooted my machine, here is the content of ncu-profiling-normal-users.conf file:

cat /etc/modprobe.d/ncu-profiling-normal-users.conf 
options nvidia "NVreg_RestrictProfilingToAdminUsers=0"

However, when I tried to profile my hello program again with ncu, still failed:

ncu ./hello
Hello World from host!
==PROF== Connected to process 2491 (/home/frank/Desktop/cosine_canberra/hello)
==ERROR== Unknown Error on device 0.
Hello World! from thread [0, 0] from device
==PROF== Disconnected from process 2491
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

sudo ncu ./hello
Hello World from host!
==PROF== Connected to process 2512 (/home/frank/Desktop/cosine_canberra/hello)
==ERROR== Unknown Error on device 0.
Hello World! from thread [0, 0] from device
==PROF== Disconnected from process 2512
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

Hello,

Nsight Compute does not support Pascal cards, only Volta and newer can be used as stated in its documentation.

If you wish profile on your Pascal card, you can look into nvvp and nvprof.

Although the permission issue itself was solved by the commands I put in my previous message, the process that restarted whenever you killed it is the NVIDIA Persistence Daemon.

It might be run as a systemd service which would require you to stop or disable it temporarily to unload the module:

systemctl disable --now nvidia-persistenced
[...]   # Do what you need to do here
systemctl enable --now nvidia-persistenced

Hi Gduque,

I see, after I run these following 3 commands you provided:

echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee /etc/modprobe.d/ncu-profiling-normal-users.conf
sudo update-initramfs -c -k all
systemctl reboot

I can run the nvprof without the permission warning!

To test Nsight Compute, I started a new vm from Azure, here is vm’s GPU:

sudo lshw -C display
  *-display                 
       description: 3D controller
       product: GV100GL [Tesla V100 PCIe 16GB]
       vendor: NVIDIA Corporation
       physical id: 1
       bus info: pci@0001:00:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list
       configuration: driver=nvidia latency=0
       resources: iomemory:100-ff iomemory:140-13f irq:24 memory:41000000-41ffffff memory:1000000000-13ffffffff memory:1400000000-1401ffffff

After running the above 3 commands, I can successfully run hello world with ncu too:

ncu ./hello
Hello World from host!
==PROF== Connected to process 1333 (/home/azureuser/bts-cuda-poc/hello)
==PROF== Profiling "_Z14print_from_gpuv" - 0: Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
0%....50%....100% - 19 passes
==PROF== Disconnected from process 1333
[1333] hello@127.0.0.1
  _Z14print_from_gpuv, 2022-Aug-08 06:57:12, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/usecond                         743.56
    SM Frequency                                                             cycle/nsecond                           1.06
    Elapsed Cycles                                                                   cycle                          65956
    Memory [%]                                                                           %                           0.04
    DRAM Throughput                                                                      %                           0.00
    Duration                                                                       usecond                          62.14
    L1/TEX Cache Throughput                                                              %                           1.79
    L2 Cache Throughput                                                                  %                           0.04
    SM Active Cycles                                                                 cycle                         806.70
    Compute (SM) [%]                                                                     %                           0.03
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full      
          waves across all SMs. Look at Launch Statistics for more details.                                             

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                          1
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                           1
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                  byte                              0
    Driver Shared Memory Per Block                                              byte/block                              0
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                              1
    Waves Per SM                                                                                                     0.00
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Threads are executed in groups of 32 threads called warps. This kernel launch is configured to execute 1      
          threads per block. Consequently, some threads in a warp are masked off and those hardware resources are       
          unused. Try changing the number of threads per block to be a multiple of 32 threads. Between 128 and 256      
          threads per block is a good initial range for experimentation. Use smaller thread blocks rather than one      
          large thread block per multiprocessor if latency affects performance.  This is particularly beneficial to     
          kernels that frequently call __syncthreads(). See the Hardware Model                                          
          (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
          details on launch configurations.                                                                             
    ----- --------------------------------------------------------------------------------------------------------------
    WRN   The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 80              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the    
          Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)            
          description for more details on launch configurations.                                                        

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                             64
    Block Limit Shared Mem                                                           block                             32
    Block Limit Warps                                                                block                             64
    Theoretical Active Warps per SM                                                   warp                             32
    Theoretical Occupancy                                                                %                             50
    Achieved Occupancy                                                                   %                           1.56
    Achieved Active Warps Per SM                                                      warp                              1
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   This kernel's theoretical occupancy (50.0%) is limited by the required amount of shared memory This kernel's  
          theoretical occupancy (50.0%) is limited by the number of blocks that can fit on the SM The difference        
          between calculated theoretical (50.0%) and measured achieved occupancy (1.6%) can be the result of warp       
          scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between    
          warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide           
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
          optimizing occupancy. 

Thank you so much~

1 Like