yes - I am trying to run ncu as part of the remote script, I just need to check if/where ncu is installed on the remote system. Then I can use the command:
ncu -o profile ./ufo-tesla
my colleague has found that this command is needed in order to run ncu on the remote cloud machine…
docker run --rm -t -v ${DATA_DIR}:/home --workdir /home --gpus all nvidia/cuda:11.1.1-devel-ubuntu20.04 ncu -o profile --target-processes all ufo-tesla
However when it runs im getting this error message just before the iterations start:
==PROF== Connected to process 38 (/home/ufo-tesla)
==ERROR== Error: ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
and this error message when it finishes iterating…
==PROF== Disconnected from process 38
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
Any idea how to get the profiling to work here?
(I just realised that its related to permissions)
It says…
To allow access for any user, create a file with the .conf extension containing options nvidia NVreg_RestrictProfilingToAdminUsers=0
in /etc/modprobe.d.
It’s a bit over on the right, but the error message includes the link to get more information about enabling permissions to allow for profiling.
For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM`
yeah thanks - we were trying to create an nvidia.conf file from within the docker command, like this…
docker run --rm -t -v ${DATA_DIR}:/home --workdir /home --gpus all nvidia/cuda:11.1.1-devel-ubuntu20.04 “echo ‘options nvidia NVreg_RestrictProfilingToAdminUsers=0’ > /etc/modprobe.d/nvidia.conf && ncu -o profile --target-processes all ufo-tesla”
but there is a problem with “no such file or directory”
Editing files the modprobe.d directory does require root permissions and usually takes a reboot for the new configuration file to take effect. Hence, doesn’t seem like something you can set this way.
Though, this is out of my area, so I don’t really know. You might try asking over on the Nsight-Systems forum: Nsight Systems - NVIDIA Developer Forums
yes, thanks that might be the problem, but everything is being done by scripted commands on a remote cloud machine which makes it more tricky to get working.
OK got it to work with (just for the record) this command:
docker run --cap-add SYS_ADMIN --rm -t -v ${DATA_DIR}:/home --workdir /home --gpus all nvidia/cuda:11.1.1-devel-ubuntu20.04 ncu -o profile --target-processes all ufo-tesla
so the –cap-add SYS_ADMIN was the important bit, then it ran fine and has generated a profile.ncu-rep file. So my next question is, what do you do with the profile.ncu-rep file?
Open it in the Nsight-Compute GUI from you desktop, or from the command line via “ncu -i profile.ncu-rep”.
I prefer the GUI but if you do use the command line, I suggest you redirect the output to a text file given it can be rather lengthy.
brilliant that worked - I used the command line mode to generate a text file.
It seems to generate a block of text for each iterations of the solver (100 its)
So here is the data it generated for the 100th iteration…
The mesh size was 21million cells.
There are a couple of warnings which look important,
Can you help me to understand what it means?
flow_solve_958_gpu__red (512, 1, 1)x(256, 1, 1), Context 1, Stream 13, Device 0, CC 7.5
Section: GPU Speed Of Light
---------------- ------------- ------------
Metric Name Metric Unit Metric Value
---------------- ------------- ------------
DRAM Frequency cycle/nsecond 5.00
SM Frequency cycle/usecond 585.50
Elapsed Cycles cycle 88718
SOL Memory % 1.56
SOL DRAM % 1.56
Duration usecond 151.52
SOL L1/TEX Cache % 9.37
SOL L2 Cache % 0.43
SM Active Cycles cycle 4372.05
SOL SM % 0.31
---------------- ------------- ------------
WRN This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for
more details.
Section: Launch Statistics
-------------------------------- --------------- ------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ------------
Block Size 256
Grid Size 2
Registers Per Thread register/thread 16
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block Kbyte/block 1.02
Static Shared Memory Per Block byte/block 0
Threads thread 512
Waves Per SM 0.01
-------------------------------- --------------- ------------
WRN The grid for this launch is configured to execute only 2 blocks, which is less than the GPU's 40
multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel
concurrently with other workloads, consider reducing the block size to have at least one block per
multiprocessor or increase the size of the grid to fully utilize the available hardware resources.
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 16
Block Limit Shared Mem block 64
Block Limit Warps block 4
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 100
Achieved Occupancy % 24.83
Achieved Active Warps Per SM warp 7.94
------------------------------- ----------- ------------
It means that there’s not enough work in this kernel to make effective use of the GPU.
However, this is the compiler generated kernel that does the final reductions (i.e. it ends in “_red”) so not expected to fully use the GPU. You can ignore it.
Ok, but how does all this profile data help me to improve the speedup?
That’s probably too big a topic for the forum, so I suggest looking at some of the training materials and videos available at: Getting Started with Nsight Compute | NVIDIA Developer
Doing a web search for “nsight compute tutorial” will give you several of training videos.
The key things I look for:
What is the occupancy and register usage? If register usage is high (over 32 or 64 registers per threads) can this be reduced by removing local variables or splitting the kernel into multiple kernels.
Does the achieved occupancy match the theoretical? If not, what is causing the warp stalls? Are they stalled waiting for memory or stalled waiting for another pipeline such as the floating point units?
What is the Cache hit percentage? i.e. is there reuse of data so it stays in cache or is the code streaming data? If streaming, is it achieving peak memory bandwidth performance?
Is the workload big enough to fully utilize the GPU?