Nsys is not collecting kernel data

help!

I am cuda beginner and I met the same problem with you, so I turned to nsight compute to profile the kernel. Hope my solution can help you :)

Any updates?

Hi, coming from the self-paced course titled “Getting Started with Accelerated Computing in CUDA C/C++” and I am running into the same issue.

Exercise: Explore UM Migration and Page Faulting
CUDA Memory Operation Statistics
nsys profile provides output describing UM behavior for the profiled application. In this exercise, you will make several modifications to a simple application, and make use of nsys profile after each change, to explore how UM data migration behaves.

01-page-faults.cu contains a hostFunction and a gpuKernel, both which could be used to initialize the elements of a 2<<24 element vector with the number 1. Currently neither the host function nor GPU kernel are being used.

For each of the 4 questions below, given what you have just learned about UM behavior, first hypothesize about what kind of page faulting should happen, then, edit 01-page-faults.cu to create a scenario, by using one or both of the 2 provided functions in the code bases, that will allow you to test your hypothesis.

In order to test your hypotheses, compile and profile your code using the code execution cells below. Be sure to record your hypotheses, as well as the results, obtained from nsys profile --stats=true output. In the output of nsys profile --stats=true you should be looking for the following:

  • Is there a CUDA Memory Operation Statistics section in the output?
  • If so, does it indicate host to device (HtoD) or device to host (DtoH) migrations?
  • When there are migrations, what does the output say about how many Operations there were? If you see many small memory migration operations, this is a sign that on-demand page faulting is occurring, with small memory migrations occurring each time there is a page fault in the requested location.
    Here are the scenarios for you to explore, along with solutions for them if you get stuck:

Is there evidence of memory migration and/or page faulting when unified memory is accessed only by the CPU? (solution)
Is there evidence of memory migration and/or page faulting when unified memory is accessed only by the GPU? (solution)
Is there evidence of memory migration and/or page faulting when unified memory is accessed first by the CPU then the GPU? (solution)
Is there evidence of memory migration and/or page faulting when unified memory is accessed first by the GPU then the CPU? (solution)
!nvcc -o page-faults 06-unified-memory-page-faults/01-page-faults.cu -run
!nsys profile --stats=true ./page-faults
Generating ‘/tmp/nsys-report-a9d6.qdstrm’
[1/8] [========================100%] report9.nsys-rep
[2/8] [========================100%] report9.sqlite
[3/8] Executing ‘nvtx_sum’ stats report
SKIPPED: /dli/task/report9.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing ‘osrt_sum’ stats report

Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name


 66.4        150789889         17  8869993.5  10065220.0      2300  52077135   12553520.3  poll                  
 17.7         40219962         15  2681330.8     34372.0       100  20444672    6262185.5  sem_timedwait         
 13.1         29729134        482    61678.7     10300.0       400   8098000     391386.3  ioctl                 
  1.9          4388357         18   243797.6      6790.0      1170   4253932    1000824.3  mmap                  
  0.4           863247         27    31972.1      4010.0      2720    543750     102990.6  mmap64                
  0.2           458409         44    10418.4      9656.0      3230     29411       4605.3  open64                
  0.1           162196          4    40549.0     35966.5     31781     58482      12180.1  pthread_create        
  0.1           146235         29     5042.6      3130.0      1020     29471       5928.9  fopen                 
  0.1           130063         11    11823.9     12750.0       960     20140       4861.9  write                 
  0.0            89762          7    12823.1      3610.0      2700     49152      17649.0  munmap                
  0.0            50812         26     1954.3        70.0        50     49072       9610.2  fgets                 
  0.0            34851          6     5808.5      6185.0      2660      8691       2223.7  open                  
  0.0            31511         52      606.0       460.0       150      5871        785.8  fcntl                 
  0.0            24981         22     1135.5       905.0       500      3790        688.9  fclose                
  0.0            19312         14     1379.4      1155.0       550      3541        950.7  read                  
  0.0            14080          2     7040.0      7040.0      3090     10990       5586.1  socket                
  0.0            11841          1    11841.0     11841.0     11841     11841          0.0  connect               
  0.0             8070          5     1614.0      1310.0        90      3400       1523.0  fread                 
  0.0             7780          1     7780.0      7780.0      7780      7780          0.0  pipe2                 
  0.0             5570         64       87.0        50.0        40       330         55.5  pthread_mutex_trylock 
  0.0             2360          1     2360.0      2360.0      2360      2360          0.0  bind                  
  0.0             1080          1     1080.0      1080.0      1080      1080          0.0  listen                
  0.0              280          1      280.0       280.0       280       280          0.0  pthread_cond_broadcast

[5/8] Executing ‘cuda_api_sum’ stats report

Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name


 85.0        107114744          1  107114744.0  107114744.0  107114744  107114744          0.0  cudaMallocManaged    
 11.6         14622934          1   14622934.0   14622934.0   14622934   14622934          0.0  cudaDeviceSynchronize
  3.4          4307444          1    4307444.0    4307444.0    4307444    4307444          0.0  cudaFree             
  0.0            28741          1      28741.0      28741.0      28741      28741          0.0  cudaLaunchKernel     

[6/8] Executing ‘cuda_gpu_kern_sum’ stats report

Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name


100.0         14618831          1  14618831.0  14618831.0  14618831  14618831          0.0  deviceKernel(int *, int)

[7/8] Executing ‘cuda_gpu_mem_time_sum’ stats report
SKIPPED: /dli/task/report9.sqlite does not contain GPU memory data.
[8/8] Executing ‘cuda_gpu_mem_size_sum’ stats report
SKIPPED: /dli/task/report9.sqlite does not contain GPU memory data.
Generated:
/dli/task/report9.nsys-rep
/dli/task/report9.sqlite

Hello!Do you solve the problem?I viewed all reply above,seems like wsl2 cannot use Nsight System normally.Can latest version Nsight System collect datas in wls2 correctly ?

My version of Nsight System is 2024.5.1.113-245134619542v0,and I’m still meet the same problem in wsl2 :(

same for NVIDIA Nsight Systems version 2024.4.2.133-244234382004v0, also meet the same problem.