help!
I am cuda beginner and I met the same problem with you, so I turned to nsight compute to profile the kernel. Hope my solution can help you :)
Any updates?
Hi, coming from the self-paced course titled “Getting Started with Accelerated Computing in CUDA C/C++” and I am running into the same issue.
Exercise: Explore UM Migration and Page Faulting
CUDA Memory Operation Statistics
nsys profile
provides output describing UM behavior for the profiled application. In this exercise, you will make several modifications to a simple application, and make use of nsys profile
after each change, to explore how UM data migration behaves.
01-page-faults.cu contains a hostFunction
and a gpuKernel
, both which could be used to initialize the elements of a 2<<24
element vector with the number 1
. Currently neither the host function nor GPU kernel are being used.
For each of the 4 questions below, given what you have just learned about UM behavior, first hypothesize about what kind of page faulting should happen, then, edit 01-page-faults.cu to create a scenario, by using one or both of the 2 provided functions in the code bases, that will allow you to test your hypothesis.
In order to test your hypotheses, compile and profile your code using the code execution cells below. Be sure to record your hypotheses, as well as the results, obtained from nsys profile --stats=true
output. In the output of nsys profile --stats=true
you should be looking for the following:
- Is there a CUDA Memory Operation Statistics section in the output?
- If so, does it indicate host to device (HtoD) or device to host (DtoH) migrations?
- When there are migrations, what does the output say about how many Operations there were? If you see many small memory migration operations, this is a sign that on-demand page faulting is occurring, with small memory migrations occurring each time there is a page fault in the requested location.
Here are the scenarios for you to explore, along with solutions for them if you get stuck:
Is there evidence of memory migration and/or page faulting when unified memory is accessed only by the CPU? (solution)
Is there evidence of memory migration and/or page faulting when unified memory is accessed only by the GPU? (solution)
Is there evidence of memory migration and/or page faulting when unified memory is accessed first by the CPU then the GPU? (solution)
Is there evidence of memory migration and/or page faulting when unified memory is accessed first by the GPU then the CPU? (solution)
!nvcc -o page-faults 06-unified-memory-page-faults/01-page-faults.cu -run
!nsys profile --stats=true ./page-faults
Generating ‘/tmp/nsys-report-a9d6.qdstrm’
[1/8] [========================100%] report9.nsys-rep
[2/8] [========================100%] report9.sqlite
[3/8] Executing ‘nvtx_sum’ stats report
SKIPPED: /dli/task/report9.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing ‘osrt_sum’ stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
66.4 150789889 17 8869993.5 10065220.0 2300 52077135 12553520.3 poll
17.7 40219962 15 2681330.8 34372.0 100 20444672 6262185.5 sem_timedwait
13.1 29729134 482 61678.7 10300.0 400 8098000 391386.3 ioctl
1.9 4388357 18 243797.6 6790.0 1170 4253932 1000824.3 mmap
0.4 863247 27 31972.1 4010.0 2720 543750 102990.6 mmap64
0.2 458409 44 10418.4 9656.0 3230 29411 4605.3 open64
0.1 162196 4 40549.0 35966.5 31781 58482 12180.1 pthread_create
0.1 146235 29 5042.6 3130.0 1020 29471 5928.9 fopen
0.1 130063 11 11823.9 12750.0 960 20140 4861.9 write
0.0 89762 7 12823.1 3610.0 2700 49152 17649.0 munmap
0.0 50812 26 1954.3 70.0 50 49072 9610.2 fgets
0.0 34851 6 5808.5 6185.0 2660 8691 2223.7 open
0.0 31511 52 606.0 460.0 150 5871 785.8 fcntl
0.0 24981 22 1135.5 905.0 500 3790 688.9 fclose
0.0 19312 14 1379.4 1155.0 550 3541 950.7 read
0.0 14080 2 7040.0 7040.0 3090 10990 5586.1 socket
0.0 11841 1 11841.0 11841.0 11841 11841 0.0 connect
0.0 8070 5 1614.0 1310.0 90 3400 1523.0 fread
0.0 7780 1 7780.0 7780.0 7780 7780 0.0 pipe2
0.0 5570 64 87.0 50.0 40 330 55.5 pthread_mutex_trylock
0.0 2360 1 2360.0 2360.0 2360 2360 0.0 bind
0.0 1080 1 1080.0 1080.0 1080 1080 0.0 listen
0.0 280 1 280.0 280.0 280 280 0.0 pthread_cond_broadcast
[5/8] Executing ‘cuda_api_sum’ stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
85.0 107114744 1 107114744.0 107114744.0 107114744 107114744 0.0 cudaMallocManaged
11.6 14622934 1 14622934.0 14622934.0 14622934 14622934 0.0 cudaDeviceSynchronize
3.4 4307444 1 4307444.0 4307444.0 4307444 4307444 0.0 cudaFree
0.0 28741 1 28741.0 28741.0 28741 28741 0.0 cudaLaunchKernel
[6/8] Executing ‘cuda_gpu_kern_sum’ stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
100.0 14618831 1 14618831.0 14618831.0 14618831 14618831 0.0 deviceKernel(int *, int)
[7/8] Executing ‘cuda_gpu_mem_time_sum’ stats report
SKIPPED: /dli/task/report9.sqlite does not contain GPU memory data.
[8/8] Executing ‘cuda_gpu_mem_size_sum’ stats report
SKIPPED: /dli/task/report9.sqlite does not contain GPU memory data.
Generated:
/dli/task/report9.nsys-rep
/dli/task/report9.sqlite
Hello!Do you solve the problem?I viewed all reply above,seems like wsl2 cannot use Nsight System normally.Can latest version Nsight System collect datas in wls2 correctly ?
My version of Nsight System is 2024.5.1.113-245134619542v0,and I’m still meet the same problem in wsl2 :(
same for NVIDIA Nsight Systems version 2024.4.2.133-244234382004v0, also meet the same problem.