Error if "private" not on same line as "parallel loop"

yes - I am trying to run ncu as part of the remote script, I just need to check if/where ncu is installed on the remote system. Then I can use the command:
ncu -o profile ./ufo-tesla

my colleague has found that this command is needed in order to run ncu on the remote cloud machine…

docker run --rm -t -v ${DATA_DIR}:/home --workdir /home --gpus all nvidia/cuda:11.1.1-devel-ubuntu20.04 ncu -o profile --target-processes all ufo-tesla

However when it runs im getting this error message just before the iterations start:

==PROF== Connected to process 38 (/home/ufo-tesla)
==ERROR== Error: ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM

and this error message when it finishes iterating…

==PROF== Disconnected from process 38
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.

Any idea how to get the profiling to work here?
(I just realised that its related to permissions)

It says…

To allow access for any user, create a file with the .conf extension containing options nvidia NVreg_RestrictProfilingToAdminUsers=0 in /etc/modprobe.d.

It’s a bit over on the right, but the error message includes the link to get more information about enabling permissions to allow for profiling.

For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM`

yeah thanks - we were trying to create an nvidia.conf file from within the docker command, like this…

docker run --rm -t -v ${DATA_DIR}:/home --workdir /home --gpus all nvidia/cuda:11.1.1-devel-ubuntu20.04 “echo ‘options nvidia NVreg_RestrictProfilingToAdminUsers=0’ > /etc/modprobe.d/nvidia.conf && ncu -o profile --target-processes all ufo-tesla”

but there is a problem with “no such file or directory”

Editing files the modprobe.d directory does require root permissions and usually takes a reboot for the new configuration file to take effect. Hence, doesn’t seem like something you can set this way.

Though, this is out of my area, so I don’t really know. You might try asking over on the Nsight-Systems forum: Nsight Systems - NVIDIA Developer Forums

yes, thanks that might be the problem, but everything is being done by scripted commands on a remote cloud machine which makes it more tricky to get working.

OK got it to work with (just for the record) this command:

docker run --cap-add SYS_ADMIN --rm -t -v ${DATA_DIR}:/home --workdir /home --gpus all nvidia/cuda:11.1.1-devel-ubuntu20.04 ncu -o profile --target-processes all ufo-tesla

so the –cap-add SYS_ADMIN was the important bit, then it ran fine and has generated a profile.ncu-rep file. So my next question is, what do you do with the profile.ncu-rep file?

Open it in the Nsight-Compute GUI from you desktop, or from the command line via “ncu -i profile.ncu-rep”.

I prefer the GUI but if you do use the command line, I suggest you redirect the output to a text file given it can be rather lengthy.

brilliant that worked - I used the command line mode to generate a text file.
It seems to generate a block of text for each iterations of the solver (100 its)

So here is the data it generated for the 100th iteration…
The mesh size was 21million cells.
There are a couple of warnings which look important,
Can you help me to understand what it means?

  flow_solve_958_gpu__red (512, 1, 1)x(256, 1, 1), Context 1, Stream 13, Device 0, CC 7.5
    Section: GPU Speed Of Light
    ---------------- ------------- ------------
    Metric Name        Metric Unit Metric Value
    ---------------- ------------- ------------
    DRAM Frequency   cycle/nsecond         5.00
    SM Frequency     cycle/usecond       585.50
    Elapsed Cycles           cycle        88718
    SOL Memory                   %         1.56
    SOL DRAM                     %         1.56
    Duration               usecond       151.52
    SOL L1/TEX Cache             %         9.37
    SOL L2 Cache                 %         0.43
    SM Active Cycles         cycle      4372.05
    SOL SM                       %         0.31
    ---------------- ------------- ------------

    WRN   This kernel grid is too small to fill the available resources on this device. Look at Launch Statistics for   
          more details.                                                                                                 

    Section: Launch Statistics
    -------------------------------- --------------- ------------
    Metric Name                          Metric Unit Metric Value
    -------------------------------- --------------- ------------
    Block Size                                                256
    Grid Size                                                   2
    Registers Per Thread             register/thread           16
    Shared Memory Configuration Size           Kbyte        32.77
    Driver Shared Memory Per Block        byte/block            0
    Dynamic Shared Memory Per Block      Kbyte/block         1.02
    Static Shared Memory Per Block        byte/block            0
    Threads                                   thread          512
    Waves Per SM                                             0.01
    -------------------------------- --------------- ------------

    WRN   The grid for this launch is configured to execute only 2 blocks, which is less than the GPU's 40              
          multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel      
          concurrently with other workloads, consider reducing the block size to have at least one block per            
          multiprocessor or increase the size of the grid to fully utilize the available hardware resources.            

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           16
    Block Limit Shared Mem                block           64
    Block Limit Warps                     block            4
    Theoretical Active Warps per SM        warp           32
    Theoretical Occupancy                     %          100
    Achieved Occupancy                        %        24.83
    Achieved Active Warps Per SM           warp         7.94
    ------------------------------- ----------- ------------

It means that there’s not enough work in this kernel to make effective use of the GPU.

However, this is the compiler generated kernel that does the final reductions (i.e. it ends in “_red”) so not expected to fully use the GPU. You can ignore it.

Ok, but how does all this profile data help me to improve the speedup?

That’s probably too big a topic for the forum, so I suggest looking at some of the training materials and videos available at: Getting Started with Nsight Compute | NVIDIA Developer

Doing a web search for “nsight compute tutorial” will give you several of training videos.

The key things I look for:

What is the occupancy and register usage? If register usage is high (over 32 or 64 registers per threads) can this be reduced by removing local variables or splitting the kernel into multiple kernels.

Does the achieved occupancy match the theoretical? If not, what is causing the warp stalls? Are they stalled waiting for memory or stalled waiting for another pipeline such as the floating point units?

What is the Cache hit percentage? i.e. is there reuse of data so it stays in cache or is the code streaming data? If streaming, is it achieving peak memory bandwidth performance?

Is the workload big enough to fully utilize the GPU?