Generating CUPTI_* tables with nsys

HI,
I’m a new user the Nsight Systems. I’ve created a docker container to run the command line, nsys, on CentOS 7. Our system has two Tesla V100 GPUs.

Container was run in the following manner:

docker run --rm --gpus=all --cap-add=SYS_ADMIN --net=host -v $(pwd):/data -w /data -it centos-gpu-tools:latest bash

The nsys status command results:

[root@syseng-2-dell-hpc gpu-burn]# nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 2
Linux Distribution = CentOS
Linux Kernel Version = 3.10.0-1160.80.1.el7.x86_64: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

I run nsys with a test application, gpu_burn:

[root@syseng-2-dell-hpc gpu-burn]# nsys profile  -t cuda,nvtx,osrt,cublas -f true -o /data/results2 --gpu-metrics-device=all --cuda-memory-usage=true --export=sqlite ./gpu_burn 30
Burning for 30 seconds.
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-94dfee0f-03e6-52e2-bdb5-705f1c0f8b9f)
GPU 1: Tesla V100-PCIE-32GB (UUID: GPU-ccd7bd6d-e9bb-b57e-9ca0-7690deef2b6d)
Initialized device 0 with 32510 MB of memory (32052 MB available, using 28847 MB of it), using FLOATS
Results are 16777216 bytes each, thus performing 1800 iterations
Initialized device 1 with 32510 MB of memory (32052 MB available, using 28847 MB of it), using FLOATS
Results are 16777216 bytes each, thus performing 1800 iterations
16.7%  proc'd: 1800 (6691 Gflop/s) - 0 (0 Gflop/s)   errors: 0 - 0   temps: 28 C - 26 C 
	Summary at:   Mon Dec  5 16:14:24 UTC 2022

33.3%  proc'd: 5400 (12878 Gflop/s) - 3600 (12884 Gflop/s)   errors: 0 - 0   temps: 39 C - 39 C 
	Summary at:   Mon Dec  5 16:14:29 UTC 2022

50.0%  proc'd: 9000 (12876 Gflop/s) - 7200 (12920 Gflop/s)   errors: 0 - 0   temps: 42 C - 42 C 
	Summary at:   Mon Dec  5 16:14:34 UTC 2022

66.7%  proc'd: 12600 (12849 Gflop/s) - 10800 (12916 Gflop/s)   errors: 0 - 0   temps: 44 C - 43 C 
	Summary at:   Mon Dec  5 16:14:39 UTC 2022

80.0%  proc'd: 14400 (12836 Gflop/s) - 16200 (12866 Gflop/s)   errors: 0 - 0   temps: 46 C - 46 C 
	Summary at:   Mon Dec  5 16:14:43 UTC 2022

96.7%  proc'd: 19800 (12879 Gflop/s) - 18000 (12852 Gflop/s)   errors: 0 - 0   temps: 48 C - 49 C 
	Summary at:   Mon Dec  5 16:14:48 UTC 2022

100.0%  proc'd: 19800 (12879 Gflop/s) - 19800 (12731 Gflop/s)   errors: 0 - 0   temps: 48 C - 49 C 
Killing processes.. Freed memory for dev 0
Uninitted cublas
Freed memory for dev 1
Uninitted cublas
done

Tested 2 GPUs:
	GPU 0: OK
	GPU 1: OK
Generating '/tmp/nsys-report-cdc5.qdstrm'
[1/2] [========================100%] results2.nsys-rep
[2/2] [========================100%] results2.sqlite
Generated:
    /data/results2.nsys-rep
    /data/results2.sqlite

The following are the tables in the sqlite3 database:

-bash-4.2$ sqlite3 results2.sqlite
SQLite version 3.7.17 2013-05-20 00:56:22
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .tables
ANALYSIS_DETAILS                     ENUM_OPENMP_MUTEX                  
COMPOSITE_EVENTS                     ENUM_OPENMP_SYNC_REGION            
ENUM_CUDA_DEV_MEM_EVENT_OPER         ENUM_OPENMP_TASK_FLAG              
ENUM_CUDA_FUNC_CACHE_CONFIG          ENUM_OPENMP_TASK_STATUS            
ENUM_CUDA_KRENEL_LAUNCH_TYPE         ENUM_OPENMP_THREAD                 
ENUM_CUDA_MEMCPY_OPER                ENUM_OPENMP_WORK                   
ENUM_CUDA_MEMPOOL_OPER               ENUM_SAMPLING_THREAD_STATE         
ENUM_CUDA_MEMPOOL_TYPE               ENUM_SLI_TRANSER                   
ENUM_CUDA_MEM_KIND                   ENUM_STACK_UNWIND_METHOD           
ENUM_CUDA_SHARED_MEM_LIMIT_CONFIG    ENUM_VULKAN_PIPELINE_CREATION_FLAGS
ENUM_CUDA_UNIF_MEM_ACCESS_TYPE       ENUM_WDDM_ENGINE_TYPE              
ENUM_CUDA_UNIF_MEM_MIGRATION         ENUM_WDDM_INTERRUPT_TYPE           
ENUM_CUPTI_STREAM_TYPE               ENUM_WDDM_PACKET_TYPE              
ENUM_CUPTI_SYNC_TYPE                 ENUM_WDDM_PAGING_QUEUE_TYPE        
ENUM_D3D12_CMD_LIST_TYPE             ENUM_WDDM_VIDMM_OP_TYPE            
ENUM_D3D12_HEAP_FLAGS                EXPORT_META_DATA                   
ENUM_D3D12_HEAP_TYPE                 NVTX_EVENTS                        
ENUM_D3D12_PAGE_PROPERTY             OSRT_API                           
ENUM_DXGI_FORMAT                     OSRT_CALLCHAINS                    
ENUM_GPU_CTX_SWITCH                  PROCESSES                          
ENUM_NSYS_EVENT_CLASS                PROFILER_OVERHEAD                  
ENUM_NSYS_EVENT_TYPE                 ProcessStreams                     
ENUM_NVDRIVER_EVENT_ID               SAMPLING_CALLCHAINS                
ENUM_OPENACC_DEVICE                  SCHED_EVENTS                       
ENUM_OPENACC_EVENT_KIND              StringIds                          
ENUM_OPENGL_DEBUG_SEVERITY           TARGET_INFO_GPU                    
ENUM_OPENGL_DEBUG_SOURCE             TARGET_INFO_SESSION_START_TIME     
ENUM_OPENGL_DEBUG_TYPE               TARGET_INFO_SYSTEM_ENV             
ENUM_OPENMP_DISPATCH                 ThreadNames                        
ENUM_OPENMP_EVENT_KIND               UnwindMethodType                   
sqlite> 

I was expecting the following tables to be available: CUPTI_ACTIVITY_KIND_MEMCPY, CUDA_GPU_MEMORY_USAGE_EVENTS and CUPTI_ACTIVITY_KIND_KERNEL.

So the question is, how do I get these tables? I’m assuming I’ve missed some, but not sure what.

Any help would be greatly appreciated.
Tony

That looks like the CUDA tables are missing, but I don’t know if they are missing in the results file, or were lost in export.

Can you open the .nsys-rep file in the Nsight Systems GUI and tell me if you see CUDA there? You can also look in the diagnostics drop down and see if it detected any CUDA events.

Yup, your’re right. Have a lot of the following:

Warning	Analysis	56	00:00.252	
Not all NVTX events might have been collected.
Warning	Analysis	56	00:00.252	
No NVTX events collected. Does the process use NVTX?
Warning	Analysis	74	00:00.252	
CUDA profiling might have not been started correctly.
Warning	Analysis	74	00:00.252	
No CUDA events collected. Does the process use CUDA?

The application I’m using is from:

https://github.com/wilicc/gpu-burn

T

I noticed the soft link: /usr/local/cuda/cuda-11.8 was broken (flashing red))

The following is my Dockerfile, in case this helps:

FROM nvidia/cuda:11.8.0-cudnn8-devel-centos7 as base

FROM base as base-amd64

ENV NV_CUDNN_VERSION 8.6.0.163-1
ENV NV_CUDNN_PACKAGE libcudnn8-${NV_CUDNN_VERSION}.cuda11.8
ENV NV_CUDNN_PACKAGE_DEV libcudnn8-devel-${NV_CUDNN_VERSION}.cuda11.8

FROM base-amd64

LABEL maintainer "NVIDIA CORPORATION <sw-cuda-installer@nvidia.com>"

LABEL com.nvidia.cudnn.version="${NV_CUDNN_VERSION}"

RUN yum install -y \
    ${NV_CUDNN_PACKAGE} \
    ${NV_CUDNN_PACKAGE_DEV} \
    && yum clean all \
    && rm -rf /var/cache/yum/*

# TN adding...
RUN yum install -y centos-release-scl
RUN yum install -y devtoolset-9
RUN yum install -y git
RUN echo "source /opt/rh/devtoolset-9/enable" >> /etc/bashrc

RUN yum install -y libSM
RUN yum install -y libglvnd-opengl
RUN yum install -y libxcb xcb-util-wm xcb-util-image xcb-util-renderutil xcb-util-keysyms libxkbcommon libxkbcommon-x11
RUN yum install -y libXcomposite libXcursor libXi libXtst libXrandr alsa-lib mesa-libEGL libXdamage mesa-libGL
RUN yum install -y fontconfig

ADD nsight-systems-2022.5.1-2022.5.1.82_3207805-0.x86_64.rpm /opt
RUN cd /opt && \
    rpm -iv nsight-systems-2022.5.1-2022.5.1.82_3207805-0.x86_64.rpm
RUN rm -rf /opt/nsight-systems-2022.5.1-2022.5.1.82_3207805-0.x86_64.rpm

# From https://lambdalabs.com/blog/perform-gpu-and-cpu-stress-testing-on-linux
SHELL ["/bin/bash", "--login", "-c"]
RUN cd /opt && \
    git clone https://github.com/wilicc/gpu-burn && \
    cd gpu-burn && \
    make
ENV PATH=/opt/gpu-burn:${PATH}
ENV LD_LIBRARY_PATH=/usr/local/cuda/compat:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}

CMD [ "nsys", "bash"]


Anything else I should look at?
T

Could my problem that the GPU Metrics are not supported for Tesla V100? My system has 2 V100:

[root@syseng-2-dell-hpc gpu-burn]# nsys profile  -t cuda,nvtx,osrt,cublas -f true -o /data/results2 --gpu-metrics-device=0 --cuda-memory-usage=true --export=sqlite ./gpu_burn 10
Illegal --gpu-metrics-device argument: 0.
Feature is not available on the specified set of GPUs. See https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gms-introduction.
Use the '--gpu-metrics-device=help' switch to see the full list of values.

usage: nsys profile [<args>] [application] [<application args>]
Try 'nsys profile --help' for more information.
[root@syseng-2-dell-hpc gpu-burn]# nsys profile  -t cuda,nvtx,osrt,cublas -f true -o /data/results2 --gpu-metrics-device=1 --cuda-memory-usage=true --export=sqlite ./gpu_burn 10
Illegal --gpu-metrics-device argument: 1.
Feature is not available on the specified set of GPUs. See https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gms-introduction.
Use the '--gpu-metrics-device=help' switch to see the full list of values.

usage: nsys profile [<args>] [application] [<application args>]
Try 'nsys profile --help' for more information.
[root@syseng-2-dell-hpc gpu-burn]# nsys profile  -t cuda,nvtx,osrt,cublas -f true -o /data/results2 --gpu-metrics-device=2 --cuda-memory-usage=true --export=sqlite ./gpu_burn 10
Illegal --gpu-metrics-device argument: 2.
No such GPU: 2.
Use the '--gpu-metrics-device=help' switch to see the full list of values.

usage: nsys profile [<args>] [application] [<application args>]
Try 'nsys profile --help' for more information.

Nsight Systems GPU Metrics is only available for Linux targets on x86-64 and aarch64, and for Windows targets. It requires NVIDIA Turing architecture or newer.

Minimum required driver versions:

  • NVIDIA Turing architecture TU10x, TU11x - r440
  • NVIDIA Ampere architecture GA100 - r450
  • NVIDIA Ampere architecture GA100 MIG - r470 TRD1
  • NVIDIA Ampere architecture GA10x - r455

T.

That will stop you from having GPU metrics, but you should still have the CUDA information. And you don’t have any CUDA tables and the diagnostics says it didn’t hit CUDA.

Okay, what version of nsys are you looking at? What was the exact command line you used?

Verson:
NVIDIA Nsight Systems version 2022.5.1.82-32078057v0

Command:
nsys profile -t cuda,nvtx,osrt,cublas -f true -o /data/results2 --gpu-metrics-device=all --cuda-memory-usage=true --export=sqlite ./gpu_burn 10

How long did profiling run (how long did the “gpu_burn 10” take)?

@skottapalli can you take a look at this?

gpu_burn 10 runs for 10 seconds.

Sometimes we see things like this when the application is so brief that we miss the kernels, but that isn’t the case here.

Hi tniro,

Could you try just the following command?

nsys profile -t cuda -s none --cpuctxsw=none -f true -o /data/results2 ./gpu_burn 10

This will collect just the CUDA traces (on CPU and GPU side). Could you share the report file (privately, if needed)?

What is the output of nvidia-smi command on the host system?

Unfortunately, I’m on vacation till the new year. I had some issues with the CentOS container I was using (see above Dockerfile), when the system I was using was updated from 520 to 525. Running the application (gpu_burn) in a Ubuntu container worked. However, in CentOS it stopped working when the driver was updated. I’ll be able to do more investigation once back from vacation. Here is the current nvidia-smi output

nvidia-smi
Mon Dec 19 09:30:07 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… Off | 00000000:25:00.0 Off | 0 |
| N/A 35C P0 37W / 250W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-PCIE… Off | 00000000:E2:00.0 Off | 0 |
| N/A 34C P0 36W / 250W | 0MiB / 32768MiB | 4% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

T.

Back from vacation. I’ve run with the following command:

 nsys profile  -t cuda,nvtx,osrt,cublas -f true -o /data/results2 --gpu-metrics-device=all --cuda-memory-usage=true --export=sqlite ./gpu_burn 30

I’ve attached the report file:
results2.nsys-rep (10.3 MB)

From the report, it looks like you used the command as I requested.
nsys profile -t cuda -s none --cpuctxsw=none -f true -o /data/results2 ./gpu_burn 30

CUDA kernels and API calls are present in the report you shared. What is the problem you are facing now? If you add the other features -t cuda,nvtx,osrt,cublas back to the command line, are the CUDA kernels missing from the report? If so, we will need to isolate which feature is actually causing the problem. Please try different combinations (for example, profile with -t cuda,nvtx first and if that works, profile with -t cuda,nvtx,osrt and so on)


If I just specify -t cuda then I get the expected CUPTI tables. However, adding any of the other features with cuda, the tables don’t show up in the sqllite file. I tried with “-t cuda,nvtx” “-t cuda,osrt”, “-t cuda,cublas” and not specifying -t option (default).

nvlog.config.template (648 Bytes)
I see. Thanks for the update. Could you try the options without cuda and check if you get any events in the report? It is possible that there is a bug with CUDA tracing feature when combined with the other options. We will need to repro it on our end to investigate more.

Could you help collect logs when you see the problem?

  1. Save the nvlog.config.template file that is attached to the target system
  2. Add the CLI switch -e NVLOG_CONFIG_FILE=/full/path/to/nvlog.config.template to your command line
  3. Run the collection
    Please share the report file and the nsys-ui.log file that gets created.

Running the following:
nsys profile -t nvtx,osrt,cublas -s none --cpuctxsw=none -f true -o /rockshare/user/tniro/db/nvidia ./gpu_burn 30

Results attached:
nvidia.nsys-rep (350.6 KB)

I also tried using 2022.4.1.21 in my CentOS 7 container. Same issues.

NVIDIA has a number of containers that have nsight installed but I can’t use them since they are Ubuntu based and nsys fails when running the container on a CentOS 7 host, kernel being too low. For example using nvcr.io/nvidia/mxnet:22.12-py3 on CentOS 7, I run

root@f3a510160b6b:/workspace# nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 2
Linux Distribution = Ubuntu
Linux Kernel Version = 3.10.0-1160.81.1.el7.x86_64: Fail
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): Fail
CPU Profiling Environment (system-wide): Fail

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

I wonder if we should convert to Ubuntu? ;-)

The failure indicated in the nsys status output on CentOS 7 host means that the --sampling true feature will not work (switching to a newer kernel will address this). The rest of the features should work even on the older OS kernel.

The main problem in the docker container running on your machine is that CUDA traces are collected only if tracing is limited to CUDA. It sounds like a bug that my team needs to track down. How can we reproduce it on our end?

Also, could you help me by collecting the logs as I mentioned in my previous reply?