Nsight compute profile run with nan value in multi-process service(MPS)

viku12cool · January 5, 2023, 9:51am

Hello,
I am using ncu in multi-process service(MPS). The ncu command line options used are:
ncu --metrics gpc__cycles_elapsed.max application
I am getting nan for my run as follows.

The details of the GPU are as follows.
GPU card: NVIDIA A40
Driver version: 515.65.01
CUDA Version: 11.7

To start the MPS, I have used the following commands:
export CUDA_VISIBLE_DEVICES=“0”
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

However, when I run the application without MPS, it gives the correct profiling results.

Any help would be appreciated.

felix_dt · January 5, 2023, 9:56am

MPS is not supported by Nsight Compute, see the “Profiling and Metrics” section under the known issues.

logg72 · July 11, 2024, 6:09am

If i want to profile l2 cache hit rate and memory throughput with MPS, how can i track this information?

Andrey_Trachenko · July 11, 2024, 1:54pm

Nsight Systems should be able to collect GPU metrics for the whole GPU regardless of MPS. “L2 hit rate” metric is available in the metric set called “Graphics Throughput Metrics for NVIDIA GA10x (frequency >= 10kHz)”.

To use that, please start with the following command:

nsys profile --gpu-metrics-device=all --gpu-metrics-set=ga10x-gfxt ./myApp

logg72 · July 12, 2024, 1:09am

I turn on MPS and start multi-process in parallel, but problems occur. Below is the bash code to run program in parallel.

#!/bin/bash
file_name=./mps_no_hooking
sm_num=$1
ps_num=$2

for i in $(seq 1 $ps_num); do
  sudo nsys profile --gpu-metrics-device=0 --gpu-metrics-set=ga10x-gfxt $file_name ${i} &
done

wait

Then this error code shows

user@vandal:~/libsmctrl$ sudo nsys profile --output=report1.qdrep --trace=cuda,nvtx,osrt --cuda-memory-usage=true --capture-range=cudaProfilerApi --capture-range-end=stop ./exec_process.sh 2
SM NO: 1
PID: 25944
L2 cache size: -812471192
/home/user/libsmctrl/./mps_no_hooking:libsmctrl.c:212: Error subscribing to launch callback. CUDA returned error code 999.
Generating '/tmp/nsys-report-a76e.qdstrm'
[1/1] [========================100%] report6.nsys-rep
Generated:
    /home/user/libsmctrl/report6.nsys-rep
Generated:

Output about l2 cache size is in my code, but it shows wrong(strange) values. I don’t know the reason.
Please tell me if you have a solution about this issue. Thank you!

logg72 · July 12, 2024, 8:11am

In addition, i have one more question
server GPU is GeForce RTX 3090, and Nsight system version is 2024.4.1, and server ubuntu version is 18.04.6 LTS
I want to know GPU metric such as L2 cache hit rate. However there exists a problem like below.
GPU device doesn’t show in the list of ‘GPUs’.

logg72 · July 16, 2024, 4:44am

Please answer this question…
I have tried to solve this problem for 3 days, but failed to solve.

veraj · July 18, 2024, 6:27am

Hi, @logg72

Your “GPU device doesn’t show in the list of ‘GPUs’” seems a set up issue or operation issue.
Can you please restart a topic in “Nsight System” directly to get help ？
Thanks ！

logg72 · July 18, 2024, 7:27am

Thank you! Fortunately, i solved this problem by creating a .conf extension file and fill out with options nvidia NVreg_RestrictProfilingToAdminUsers=0.
But another problem is occured. I want to use ‘libsmctrl library’ (libsmctrl_set_next_mask). However, Nsight system doesn’t work with this library. Do you know about this reason?

veraj · July 18, 2024, 7:53am

Sorry. I am not familiar with Nsight System usage. Please ask in Nsight System forum directly.

veraj · July 25, 2024, 7:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nsight Compute metrics value confused Nsight Compute performance-metrics	1	1136	December 14, 2021
Can't Get NCU GUI To Import Properly Nsight Compute	8	1455	October 5, 2020
NsightCompute doesn't profile some metrics on SM_75 Nsight Compute	2	783	November 8, 2019
Nsys for multi GPU apps Profiling Linux Targets	1	1396	September 10, 2018
Nsight and nvprof results have large differences Nsight Compute	9	1268	November 26, 2019
n/a for metrics Nsight Compute	8	1689	December 26, 2019
Nv-nsight-cu-cli --metrics gpu__time_active ./program show n/a data Nsight Compute cuda	2	921	October 12, 2021
[Resolved] Invalid Nsight Compute	1	520	July 6, 2019
MPS capability for nsight products Profiling Linux Targets nsight	0	647	November 4, 2020
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Nsight Compute	4	691	August 13, 2019

Nsight compute profile run with nan value in multi-process service(MPS)

Related topics