I am using nsys
to profile my program. I can run nsys profile
on some simple program without any problem,
For example,
$ sudo nsys profile -o log/prof_query --force-overwrite true --stats true ./axpy_test
y[0] = 2
y[1] = 4
y[2] = 6
y[3] = 8
Generating '/tmp/nsys-report-09cb.qdstrm'
[1/8] [========================100%] prof_query.nsys-rep
[2/8] [========================100%] prof_query.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/liuxs/workarea/eecs583-final-project/llvm-nvvm-sample/log/prof_query.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ----------- ------------- ----------------------
58.8 343,469,422 5 68,693,884.4 923,068.0 107,219 316,590,043 138,982,119.1 sem_wait
19.6 114,606,684 30 3,820,222.8 84,460.0 643 33,828,776 9,127,403.9 poll
11.4 66,676,882 698 95,525.6 16,319.0 782 11,251,359 698,231.7 ioctl
8.1 47,585,738 1 47,585,738.0 47,585,738.0 47,585,738 47,585,738 0.0 pthread_cond_timedwait
1.1 6,398,541 84 76,173.1 8,932.5 4,782 1,568,189 254,995.8 mmap64
0.3 1,872,515 18 104,028.6 111,464.5 10,951 386,325 80,681.1 sem_timedwait
0.2 954,552 18 53,030.7 5,077.5 1,443 267,618 96,519.7 mmap
0.1 618,826 2 309,413.0 309,413.0 201,713 417,113 152,310.8 pthread_join
0.1 491,848 59 8,336.4 8,333.0 3,563 14,863 2,011.4 open64
0.1 418,547 1 418,547.0 418,547.0 418,547 418,547 0.0 pthread_mutex_lock
0.0 288,474 10 28,847.4 3,667.0 1,982 253,955 79,110.9 munmap
0.0 243,832 5 48,766.4 51,395.0 36,931 58,168 10,002.7 pthread_create
0.0 126,019 36 3,500.5 3,372.5 1,361 8,798 1,377.5 fopen
0.0 90,148 76 1,186.2 36.0 34 61,541 7,237.4 fgets
0.0 50,845 67 758.9 702.0 535 2,281 233.4 fcntl
0.0 47,503 29 1,638.0 1,612.0 959 2,374 298.2 fclose
0.0 34,937 24 1,455.7 1,364.5 654 2,661 470.3 read
0.0 32,877 20 1,643.9 1,549.0 901 2,986 516.0 write
0.0 22,256 5 4,451.2 3,711.0 3,585 6,608 1,279.5 open
0.0 14,193 6 2,365.5 1,121.5 58 7,990 3,144.4 fread
0.0 13,877 20 693.9 43.5 32 5,373 1,596.7 fwrite
0.0 12,064 2 6,032.0 6,032.0 3,444 8,620 3,660.0 socket
0.0 8,253 1 8,253.0 8,253.0 8,253 8,253 0.0 connect
0.0 6,694 1 6,694.0 6,694.0 6,694 6,694 0.0 pipe2
0.0 5,690 1 5,690.0 5,690.0 5,690 5,690 0.0 pthread_kill
0.0 4,801 7 685.9 659.0 590 894 98.5 dup
0.0 3,453 1 3,453.0 3,453.0 3,453 3,453 0.0 fopen64
0.0 1,957 1 1,957.0 1,957.0 1,957 1,957 0.0 bind
0.0 1,814 2 907.0 907.0 119 1,695 1,114.4 pthread_cond_signal
0.0 944 1 944.0 944.0 944 944 0.0 listen
0.0 913 16 57.1 27.0 26 391 90.9 fflush
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ---------- ------------ ----------------------
78.2 91,806,656 2 45,903,328.0 45,903,328.0 3,965 91,802,691 64,911,501.7 cudaMalloc
21.6 25,369,342 1 25,369,342.0 25,369,342.0 25,369,342 25,369,342 0.0 cudaDeviceReset
0.1 109,083 1 109,083.0 109,083.0 109,083 109,083 0.0 cuLibraryLoadData
0.0 30,035 2 15,017.5 15,017.5 12,431 17,604 3,657.9 cudaMemcpy
0.0 15,844 1 15,844.0 15,844.0 15,844 15,844 0.0 cudaLaunchKernel
0.0 8,651 1 8,651.0 8,651.0 8,651 8,651 0.0 cudaDeviceSynchronize
0.0 1,993 1 1,993.0 1,993.0 1,993 1,993 0.0 cuCtxSynchronize
0.0 954 1 954.0 954.0 954 954 0.0 cuModuleGetLoadingMode
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- -----------------------------
100.0 1,920 1 1,920.0 1,920.0 1,920 1,920 0.0 axpy(float, float *, float *)
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- -------- -------- -------- -------- ----------- ------------------
74.4 1,024 1 1,024.0 1,024.0 1,024 1,024 0.0 [CUDA memcpy DtoH]
25.6 352 1 352.0 352.0 352 352 0.0 [CUDA memcpy HtoD]
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ------------------
0.000 1 0.000 0.000 0.000 0.000 0.000 [CUDA memcpy DtoH]
0.000 1 0.000 0.000 0.000 0.000 0.000 [CUDA memcpy HtoD]
However, when I try to run that with my program, it fails with
Generating '/tmp/nsys-report-3ad6.qdstrm'
[1/8] [=====32% ] prof_my.nsys-rep
Importer error status: An unknown error occurred.
Generated:
/home/liuxs/workarea/log/nsys/prof_my.qdstrm
The code is designed for multithread and multistreams, but I’ve limited it to be single thread+stream when profiling.
How to solve this unknown error
and how to expose more details about the crash?