Hi,
I followed Unified Memory course.
I compile first example
nvcc -o single-thread-vector-add 01-vector-add/01-vector-add.cu -run
and
nsys profile --stats=true ./single-thread-vector-add
results are
Success! All values calculated correctly.
Generating '/tmp/nsys-report-eafc.qdstrm'
[1/8] [========================100%] report1.nsys-rep
[2/8] [========================100%] report1.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /dli/task/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ---------- ---------- -------- --------- ----------- ----------------------
90.4 6154621709 318 19354156.3 10074556.0 2220 100141907 27631341.2 poll
8.7 590645191 280 2109447.1 2066063.0 170 20541289 1278921.7 sem_timedwait
0.6 43862349 499 87900.5 12780.0 380 10213399 606373.8 ioctl
0.3 19354957 24 806456.5 5775.5 1080 7282794 2173280.8 mmap
0.0 1141729 27 42286.3 4531.0 3030 785654 149169.7 mmap64
0.0 511139 44 11616.8 10965.0 4510 34401 5260.6 open64
0.0 200354 29 6908.8 4040.0 1470 55051 10010.5 fopen
0.0 159332 4 39833.0 38960.5 27380 54031 13213.9 pthread_create
0.0 131542 11 11958.4 12681.0 1010 16110 4573.5 write
0.0 126084 12 10507.0 4960.0 1510 62562 16942.7 munmap
0.0 58861 26 2263.9 90.0 70 56641 11090.8 fgets
0.0 45330 6 7555.0 8395.0 3620 9840 2345.0 open
0.0 38220 52 735.0 515.0 160 6160 847.7 fcntl
0.0 32460 22 1475.5 1305.0 760 3340 697.1 fclose
0.0 23200 14 1657.1 1375.0 520 4390 1219.8 read
0.0 17580 2 8790.0 8790.0 5310 12270 4921.5 socket
0.0 11700 5 2340.0 990.0 80 7170 2954.8 fread
0.0 11490 1 11490.0 11490.0 11490 11490 0.0 connect
0.0 6020 1 6020.0 6020.0 6020 6020 0.0 pipe2
0.0 5550 64 86.7 50.0 40 170 45.7 pthread_mutex_trylock
0.0 3390 1 3390.0 3390.0 3390 3390 0.0 bind
0.0 1200 1 1200.0 1200.0 1200 1200 0.0 listen
0.0 450 1 450.0 450.0 450 450 0.0 pthread_cond_broadcast
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ---------- ----------- ---------------------
94.5 2471546863 1 2471546863.0 2471546863.0 2471546863 2471546863 0.0 cudaDeviceSynchronize
4.8 124438386 3 41479462.0 58191.0 17761 124362434 71778762.1 cudaMallocManaged
0.7 19455429 3 6485143.0 6167994.0 5964250 7323185 732880.4 cudaFree
0.0 47231 1 47231.0 47231.0 47231 47231 0.0 cudaLaunchKernel
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ---------- ----------- ----------------------------------------------
100.0 2471537085 1 2471537085.0 2471537085.0 2471537085 2471537085 0.0 addVectorsInto(float *, float *, float *, int)
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- -------- -------- -------- -------- ----------- ---------------------------------
75.5 34140556 2304 14817.9 4351.5 1983 80192 22493.8 [CUDA Unified Memory memcpy HtoD]
24.5 11060935 768 14402.3 3775.5 1279 80735 22787.8 [CUDA Unified Memory memcpy DtoH]
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ---------------------------------
402.653 2304 0.175 0.033 0.004 1.044 0.301 [CUDA Unified Memory memcpy HtoD]
134.218 768 0.175 0.033 0.004 1.044 0.301 [CUDA Unified Memory memcpy DtoH]
Generated:
/dli/task/report1.nsys-rep
/dli/task/report1.sqlite
So no problem.
Now i run and compile same source code and use same command for nsys using windows 11 and cuda toolkit 12.5
nsys profile --stats=true .\single-thread-vector.exe
Generating 'C:\Users\UTILIS~1\AppData\Local\Temp\nsys-report-ecd2.qdstrm'
[1/8] [========================100%] report9.nsys-rep
[2/8] [========================100%] report9.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: C:\Users\laurent\Documents\Visual Studio 2022\cuda_course\report9.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
SKIPPED: No data available.
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- --------- ----------- ----------------------
52,0 485592590 1 485592590,0 485592590,0 485592590 485592590 0,0 cudaDeviceSynchronize
25,0 234802399 1 234802399,0 234802399,0 234802399 234802399 0,0 cudaLaunchKernel
18,0 170693090 3 56897696,0 11941656,0 11541091 147210343 78213302,0 cudaMallocManaged
3,0 33701527 3 11233842,0 5590101,0 5406366 22705060 9934790,0 cudaFree
0,0 25365 1 25365,0 25365,0 25365 25365 0,0 cuLibraryUnload
0,0 4323 1 4323,0 4323,0 4323 4323 0,0 cuModuleGetLoadingMode
0,0 2846 1 2846,0 2846,0 2846 2846 0,0 cuCtxSynchronize
0,0 262 1 262,0 262,0 262 262 0,0 cuDeviceGetLuid
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- --------- ----------- ----------------------------------------------
100,0 485544700 1 485544700,0 485544700,0 485544700 485544700 0,0 addVectorsInto(float *, float *, float *, int)
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: C:\Users\laurent\Documents\Visual Studio 2022\cuda_course\report9.sqlite does not contain GPU memory data.
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: C:\Users\laurent\Documents\Visual Studio 2022\cuda_course\report9.sqlite does not contain GPU memory data.
Generated:
C:\Users\laurent\Documents\Visual Studio 2022\cuda_course\report9.nsys-rep
What’s wrong on windows with my command?
In wsl2
sudo nsys profile --stats=true ./single-thread-vector-add
[sudo] password for laurent:
Success! All values calculated correctly.
Generating '/tmp/nsys-report-6f22.qdstrm'
[1/8] [========================100%] report16.nsys-rep
[2/8] [========================100%] report16.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /mnt/c/Users/laurent/Documents/Visual Studio 2022/cuda_course/report16.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ---------- ----------- -------- --------- ----------- ----------------------
96.1 3905387096 43 90822955.7 100122039.0 557 100184919 26312807.9 poll
3.3 134922500 555 243103.6 32105.0 156 5136647 670217.9 ioctl
0.3 13443270 29 463561.0 3005.0 622 4482217 1380590.0 mmap
0.1 5739727 5 1147945.4 119354.0 2938 5467016 2415309.5 fread
0.0 1113219 6 185536.5 185546.5 173861 202606 9923.1 mprotect
0.0 691997 22 31454.4 2189.5 345 453696 99665.4 fopen
0.0 632149 3 210716.3 287338.0 56183 288628 133831.3 pthread_create
0.0 507690 7 72527.1 740.0 545 196809 90938.5 read
0.0 482596 1 482596.0 482596.0 482596 482596 0.0 pthread_join
0.0 313205 12 26100.4 802.0 279 279833 80052.7 fclose
0.0 141511 3 47170.3 32499.0 25620 83392 31556.9 sem_timedwait
0.0 25451 4 6362.8 7475.0 422 10079 4196.6 write
0.0 24667 35 704.8 24.0 23 23742 4008.5 fgets
0.0 18979 6 3163.2 2939.5 514 5736 1770.9 open
0.0 8009 6 1334.8 814.0 73 4285 1584.1 fwrite
0.0 6337 10 633.7 301.0 82 2161 760.6 fcntl
0.0 5256 3 1752.0 1856.0 537 2863 1166.5 pipe2
0.0 3202 2 1601.0 1601.0 952 2250 917.8 munmap
0.0 1726 1 1726.0 1726.0 1726 1726 0.0 fflush
0.0 1314 64 20.5 17.0 16 146 19.8 pthread_mutex_trylock
0.0 651 3 217.0 230.0 158 263 53.7 pthread_cond_broadcast
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ---------- ----------- ----------------------
90.0 3486957086 1 3486957086.0 3486957086.0 3486957086 3486957086 0.0 cudaDeviceSynchronize
5.6 217923759 1 217923759.0 217923759.0 217923759 217923759 0.0 cudaLaunchKernel
3.8 146405113 3 48801704.3 25097953.0 23351220 97955940 42577775.1 cudaMallocManaged
0.6 23709003 3 7903001.0 7180613.0 7080767 9447623 1338613.1 cudaFree
0.0 1103 1 1103.0 1103.0 1103 1103 0.0 cuModuleGetLoadingMode
[6/8] Executing 'cuda_gpu_kern_sum' stats report
SKIPPED: /mnt/c/Users/laurent/Documents/Visual Studio 2022/cuda_course/report16.sqlite does not contain CUDA kernel data.
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: /mnt/c/Users/laurent/Documents/Visual Studio 2022/cuda_course/report16.sqlite does not contain GPU memory data.
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: /mnt/c/Users/laurent/Documents/Visual Studio 2022/cuda_course/report16.sqlite does not contain GPU memory data.
Generated:
/mnt/c/Users/laurent/Documents/Visual Studio 2022/cuda_course/report16.nsys-rep
/mnt/c/Users/laurent/Documents/Visual Studio 2022/cuda_course/report16.sqlite
Is profiling possible on windows system?