Hi Gduque,
I see, after I run these following 3 commands you provided:
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee /etc/modprobe.d/ncu-profiling-normal-users.conf
sudo update-initramfs -c -k all
systemctl reboot
I can run the nvprof without the permission warning!
To test Nsight Compute, I started a new vm from Azure, here is vm’s GPU:
sudo lshw -C display
*-display
description: 3D controller
product: GV100GL [Tesla V100 PCIe 16GB]
vendor: NVIDIA Corporation
physical id: 1
bus info: pci@0001:00:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:100-ff iomemory:140-13f irq:24 memory:41000000-41ffffff memory:1000000000-13ffffffff memory:1400000000-1401ffffff
After running the above 3 commands, I can successfully run hello world with ncu too:
ncu ./hello
Hello World from host!
==PROF== Connected to process 1333 (/home/azureuser/bts-cuda-poc/hello)
==PROF== Profiling "_Z14print_from_gpuv" - 0: Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
Hello World! from thread [0, 0] from device
0%....50%....100% - 19 passes
==PROF== Disconnected from process 1333
[1333] hello@127.0.0.1
_Z14print_from_gpuv, 2022-Aug-08 06:57:12, Context 1, Stream 7
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/usecond 743.56
SM Frequency cycle/nsecond 1.06
Elapsed Cycles cycle 65956
Memory [%] % 0.04
DRAM Throughput % 0.00
Duration usecond 62.14
L1/TEX Cache Throughput % 1.79
L2 Cache Throughput % 0.04
SM Active Cycles cycle 806.70
Compute (SM) [%] % 0.03
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel grid is too small to fill the available resources on this device, resulting in only 0.0 full
waves across all SMs. Look at Launch Statistics for more details.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1
Registers Per Thread register/thread 32
Shared Memory Configuration Size byte 0
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 1
Waves Per SM 0.00
---------------------------------------------------------------------- --------------- ------------------------------
WRN Threads are executed in groups of 32 threads called warps. This kernel launch is configured to execute 1
threads per block. Consequently, some threads in a warp are masked off and those hardware resources are
unused. Try changing the number of threads per block to be a multiple of 32 threads. Between 128 and 256
threads per block is a good initial range for experimentation. Use smaller thread blocks rather than one
large thread block per multiprocessor if latency affects performance. This is particularly beneficial to
kernels that frequently call __syncthreads(). See the Hardware Model
(https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more
details on launch configurations.
----- --------------------------------------------------------------------------------------------------------------
WRN The grid for this launch is configured to execute only 1 blocks, which is less than the GPU's 80
multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
concurrently with other workloads, consider reducing the block size to have at least one block per
multiprocessor or increase the size of the grid to fully utilize the available hardware resources. See the
Hardware Model (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model)
description for more details on launch configurations.
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 64
Block Limit Shared Mem block 32
Block Limit Warps block 64
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 50
Achieved Occupancy % 1.56
Achieved Active Warps Per SM warp 1
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy (50.0%) is limited by the required amount of shared memory This kernel's
theoretical occupancy (50.0%) is limited by the number of blocks that can fit on the SM The difference
between calculated theoretical (50.0%) and measured achieved occupancy (1.6%) can be the result of warp
scheduling overheads or workload imbalances during the kernel execution. Load imbalances can occur between
warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices Guide
(https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.
Thank you so much~