Sqlite does not contain CUDA kernel data

Hi everyone,

I am puzzled as to why I cannot get Nsight Systems to work properly. It’s my first time using the profiler and posting here, so excuse me if the question turns out to be banal. I would be very glad if I could get some help.

I am trying to use nsys to analyze my code, however, it shows that my sqlite file doesn’t contain CUDA kernel data when my code does contain kernel function.

I get the following information

en@en-R9000p:~/document/CUDA$ nsys profile --stats=true kernel_abcWarning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
Generating '/tmp/nsys-report-de63.qdstrm'[1/8] [========================100%] report4.nsys-rep
[2/8] [========================100%] report4.sqlite[3/8] Executing 'nvtxsum' stats report
SKIPPED: /home/en/document/CUDA/report4.sqlite does not contain NV Tools Extension (NVTX) data.[4/8] Executing 'osrtsum' stats report
Operating System Runtime API Statistics:
 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)  StdDev (ns)       Name     
 --------  ---------------  ---------  ----------  ----------  --------  --------  -----------  --------------     64.6        157648308          4  39412077.0  35653315.5  28684201  57657476   13650714.7  poll          
     34.8         84933509        222    382583.4     66792.5      1944  13132284    1139266.4  ioctl               0.2           556706          2    278353.0    278353.0     56138    500568     314259.5  sem_timedwait 
      0.2           492632          4    123158.0      1462.5      1122    488585     243618.1  read                0.1           365409          3    121803.0     49074.0     46078    270257     128573.7  pthread_create
      0.0            85845         16      5365.3      4558.5      1243     15229       3517.0  mmap          
      0.0            42051          7      6007.3      4699.0      2164     13827       4292.3  fopen         
      0.0            38094          3     12698.0     12274.0     10971     14849       1973.5  write         
      0.0            29577          1     29577.0     29577.0     29577     29577          0.0  fgets         
      0.0            15149          3      5049.7      4489.0      2905      7755       2473.1  open          
      0.0             5460          3      1820.0      1673.0      1563      2224        354.2  munmap        
      0.0             5310          3      1770.0      1753.0      1573      1984        206.0  fclose        
      0.0             4097          3      1365.7      1252.0      1162      1683        278.5  fcntl         

[5/8] Executing 'cudaapisum' stats report
SKIPPED: /home/en/document/CUDA/report4.sqlite does not contain CUDA trace data.
[6/8] Executing 'gpukernsum' stats report
SKIPPED: /home/en/document/CUDA/report4.sqlite does not contain CUDA kernel data.
[7/8] Executing 'gpumemtimesum' stats report
SKIPPED: /home/en/document/CUDA/report4.sqlite does not contain GPU memory data.
[8/8] Executing 'gpumemsizesum' stats report
SKIPPED: /home/en/document/CUDA/report4.sqlite does not contain GPU memory data.
Generated:
    /home/en/document/CUDA/report4.nsys-rep
    /home/en/document/CUDA/report4.sqlite

the source code is from nvidia sample codes
NERSC/roofline-on-nvidia-gpus/-/blob/master/example-codes/kernel_abc.cu

I just used nvcc simply to compile it

nvcc -o kernel_abc kernel_abc.cu 

This is the output of nvidia-smi

Tue Jul  5 15:23:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 516.59       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   51C    P5    28W /  N/A |   1568MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

this is the output of nsys–version

NVIDIA Nsight Systems version 2022.1.3.3-1c7b5f7
1 Like

Did you find a solution?

Sorry, I haven’t found it yet

Hi,
Have you managed to solve this?

It also happens to me. I am running on wsl2 with CUDA-12.1

The CUDA trace is generated only when the code uses unified memory, other traces are skipped,

It looks like CUDA running in wsl2 still cannot be profiled. I got the results after switch to a linux workstation.

How did you get the result in Linux, I also encountered this issue under Linux, my cuda version is also 12.1

What driver are you using? I am on Driver Version: 530.30.02, CUDA Version: 12.1, NVIDIA Nsight Systems version 2023.1.2.43-32377213v0. My laptop system is ubuntu 22.04 LTS, a windows system running wsl2 won’t work for me. BTW, due to some conflicts, I recently re-installed all CUDA related software.

I am currently using 530.41.03 CUDA version: cuda_12.1.r12.1 and we share the same version of Nsight system on Arch Linux. Seems like we share pretty similar software equipment, so it might be caused by my CUDA program itself. Thanks a lot!

I am having this problem too. In my paid course online Jupyter notebook I can get output like

CUDA API Statistics:

 Time(%)  Total Time (ns)  Num Calls    Average      Minimum     Maximum            Name         
 -------  ---------------  ---------  ------------  ----------  ----------  ---------------------
    84.0       2283665392          1  2283665392.0  2283665392  2283665392  cudaDeviceSynchronize
    15.3        414896185          3   138298728.3       26049   414779033  cudaMallocManaged    
     0.8         20560958          3     6853652.7     6188919     8017388  cudaFree             
     0.0            44563          1       44563.0       44563       44563  cudaLaunchKernel     



CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances    Average      Minimum     Maximum                       Name                    
 -------  ---------------  ---------  ------------  ----------  ----------  -------------------------------------------
   100.0       2283652014          1  2283652014.0  2283652014  2283652014  addVectorsInto(float*, float*, float*, int)



CUDA Memory Operation Statistics (by time):

 Time(%)  Total Time (ns)  Operations  Average  Minimum  Maximum              Operation            
 -------  ---------------  ----------  -------  -------  -------  ---------------------------------
    76.5         68181937        2304  29592.9     1886   171196  [CUDA Unified Memory memcpy HtoD]
    23.5         20954999         768  27285.2     1119   159772  [CUDA Unified Memory memcpy DtoH]



CUDA Memory Operation Statistics (by size in KiB):

   Total     Operations  Average  Minimum  Maximum               Operation            
 ----------  ----------  -------  -------  --------  ---------------------------------
 393216.000        2304  170.667    4.000  1020.000  [CUDA Unified Memory memcpy HtoD]
 131072.000         768  170.667    4.000  1020.000  [CUDA Unified Memory memcpy DtoH]



Operating System Runtime API Statistics:

 Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum              Name           
 -------  ---------------  ---------  ----------  -------  ---------  --------------------------
    88.8       5418463947        272  19920823.3    62004  100131010  poll                      
     8.3        507130053        240   2113041.9    14965   20655744  sem_timedwait             
     2.4        147189885        685    214875.7     1001   25314454  ioctl                     
     0.4         25805056         98    263316.9     1313    7963935  mmap                      
     0.0          2502080         82     30513.2     6046      91996  open64                    
     0.0           167140          4     41785.0    33453      53511  pthread_create            
     0.0           165940          3     55313.3    53252      59362  fgets                     
     0.0           139880         11     12716.4     7972      20589  write                     
     0.0           130990         25      5239.6     1466      23609  fopen                     
     0.0           111850         79      1415.8     1145       4467  fcntl                     
     0.0            58487         11      5317.0     2588       8569  munmap                    
     0.0            34476          5      6895.2     3163       9508  open                      
     0.0            33060          7      4722.9     1066      10393  fgetc                     
     0.0            27412         18      1522.9     1034       4255  fclose                    
     0.0            22715          3      7571.7     1443      12373  pthread_rwlock_timedwrlock
     0.0            21202         11      1927.5     1055       3040  read                      
     0.0            18228          2      9114.0     8741       9487  socket                    
     0.0            17465          3      5821.7     1749      10799  fread                     
     0.0            15897          1     15897.0    15897      15897  pipe2                     
     0.0             8723          4      2180.8     2032       2529  mprotect                  
     0.0             8675          1      8675.0     8675       8675  connect                   
     0.0             3192          1      3192.0     3192       3192  bind                      
     0.0             2127          1      2127.0     2127       2127  listen

but then locally, it’s like

Success! All values calculated correctly.
Generating '/tmp/nsys-report-74de.qdstrm'
[1/8] [========================100%] report2.nsys-rep
[2/8] [========================100%] report2.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /mnt/d/cuda/cudalabs/lab2/report2.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)  Max (ns)   StdDev (ns)       Name     
 --------  ---------------  ---------  -----------  -----------  --------  ---------  -----------  --------------
     45.8        772205019          2  386102509.5  386102509.5   1415834  770789185  544029113.8  sem_wait      
     35.7        602113429          9   66901492.1   83893751.0   5071710  100200690   40811401.7  poll          
     16.5        279094689        454     614746.0      68460.0       360   33149778    2102861.3  ioctl         
      1.6         26677428         33     808406.9       6382.0      1834   10381817    2592525.5  mmap          
      0.2          3442039         13     264772.2     345086.0      1253     486376     186340.8  read          
      0.1          1177003          6     196167.2     187263.5    127462     271416      63786.1  mprotect      
      0.1           925811          2     462905.5     462905.5    367409     558402     135052.4  pthread_create
      0.0           684729          3     228243.0      68681.0     58862     557186     284915.3  sem_timedwait 
      0.0           377422         22      17155.5       3051.0       491     301774      63648.4  fopen         
      0.0           182817         12      15234.8       6758.0      1152     109618      29828.6  open          
      0.0            62740         65        965.2         50.0        40      34445       4512.6  fgets         
      0.0            45356          3      15118.7      15680.0     12895      16781       2002.9  write         
      0.0            17513          1      17513.0      17513.0     17513      17513          0.0  pipe2         
      0.0            14027         14       1001.9        937.0       511       1934        424.6  fclose        
      0.0             8849          7       1264.1       1092.0       130       4259       1371.9  fcntl         
      0.0             7084          5       1416.8       1082.0       131       3327       1341.0  fread         
      0.0             6162          2       3081.0       3081.0      1683       4479       1977.1  munmap        
      0.0             2906          1       2906.0       2906.0      2906       2906          0.0  fopen64       
      0.0              951          6        158.5         35.0        30        601        229.4  fflush        

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)   Min (ns)  Max (ns)   StdDev (ns)           Name         
 --------  ---------------  ---------  -----------  ----------  --------  ---------  -----------  ----------------------
     81.1        393023021          3  131007673.7  85131105.0  40859737  267032179  119862371.3  cudaMallocManaged     
      8.5         41113871          1   41113871.0  41113871.0  41113871   41113871          0.0  cudaDeviceSynchronize 
      7.5         36469454          3   12156484.7  11519306.0  11255830   13694318    1338302.4  cudaFree              
      2.3         11035440          1   11035440.0  11035440.0  11035440   11035440          0.0  cuLibraryLoadData     
      0.7          3252539          1    3252539.0   3252539.0   3252539    3252539          0.0  cudaLaunchKernel      
      0.0             3456          1       3456.0      3456.0      3456       3456          0.0  cuModuleGetLoadingMode

[6/8] Executing 'cuda_gpu_kern_sum' stats report
SKIPPED: /mnt/d/cuda/cudalabs/lab2/report2.sqlite does not contain CUDA kernel data.
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: /mnt/d/cuda/cudalabs/lab2/report2.sqlite does not contain GPU memory data.
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: /mnt/d/cuda/cudalabs/lab2/report2.sqlite does not contain GPU memory data.
Generated:
    /mnt/d/cuda/cudalabs/lab2/report2.nsys-rep
    /mnt/d/cuda/cudalabs/lab2/report2.sqlite

What do I do? Add everything up? The online notebook’s report for the nsys profile command gets me one number for the kernel which is what I want to be comparing across different kernel configurations isn’t it? I also have a similar setup with Windows 11, WSL2, Ubuntu 22.04, Cuda 12.1

WSL2 doesn’t allow access to all these metrics. Try to run your code in a complete Linux environment.

I see. I just got rid of my dual boot but I guess I’ll make a new one since a VM can’t access the GPU is what I probably read somewhere.