Error when generating nsys-rep

I am using nsys to profile my program. I can run nsys profile on some simple program without any problem,
For example,

$ sudo nsys profile -o log/prof_query --force-overwrite true --stats true ./axpy_test
y[0] = 2
y[1] = 4
y[2] = 6
y[3] = 8
Generating '/tmp/nsys-report-09cb.qdstrm'
[1/8] [========================100%] prof_query.nsys-rep
[2/8] [========================100%] prof_query.sqlite
[3/8] Executing 'nvtx_sum' stats report
SKIPPED: /home/liuxs/workarea/eecs583-final-project/llvm-nvvm-sample/log/prof_query.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)     StdDev (ns)            Name
 --------  ---------------  ---------  ------------  ------------  ----------  -----------  -------------  ----------------------
     58.8      343,469,422          5  68,693,884.4     923,068.0     107,219  316,590,043  138,982,119.1  sem_wait
     19.6      114,606,684         30   3,820,222.8      84,460.0         643   33,828,776    9,127,403.9  poll
     11.4       66,676,882        698      95,525.6      16,319.0         782   11,251,359      698,231.7  ioctl
      8.1       47,585,738          1  47,585,738.0  47,585,738.0  47,585,738   47,585,738            0.0  pthread_cond_timedwait
      1.1        6,398,541         84      76,173.1       8,932.5       4,782    1,568,189      254,995.8  mmap64
      0.3        1,872,515         18     104,028.6     111,464.5      10,951      386,325       80,681.1  sem_timedwait
      0.2          954,552         18      53,030.7       5,077.5       1,443      267,618       96,519.7  mmap
      0.1          618,826          2     309,413.0     309,413.0     201,713      417,113      152,310.8  pthread_join
      0.1          491,848         59       8,336.4       8,333.0       3,563       14,863        2,011.4  open64
      0.1          418,547          1     418,547.0     418,547.0     418,547      418,547            0.0  pthread_mutex_lock
      0.0          288,474         10      28,847.4       3,667.0       1,982      253,955       79,110.9  munmap
      0.0          243,832          5      48,766.4      51,395.0      36,931       58,168       10,002.7  pthread_create
      0.0          126,019         36       3,500.5       3,372.5       1,361        8,798        1,377.5  fopen
      0.0           90,148         76       1,186.2          36.0          34       61,541        7,237.4  fgets
      0.0           50,845         67         758.9         702.0         535        2,281          233.4  fcntl
      0.0           47,503         29       1,638.0       1,612.0         959        2,374          298.2  fclose
      0.0           34,937         24       1,455.7       1,364.5         654        2,661          470.3  read
      0.0           32,877         20       1,643.9       1,549.0         901        2,986          516.0  write
      0.0           22,256          5       4,451.2       3,711.0       3,585        6,608        1,279.5  open
      0.0           14,193          6       2,365.5       1,121.5          58        7,990        3,144.4  fread
      0.0           13,877         20         693.9          43.5          32        5,373        1,596.7  fwrite
      0.0           12,064          2       6,032.0       6,032.0       3,444        8,620        3,660.0  socket
      0.0            8,253          1       8,253.0       8,253.0       8,253        8,253            0.0  connect
      0.0            6,694          1       6,694.0       6,694.0       6,694        6,694            0.0  pipe2
      0.0            5,690          1       5,690.0       5,690.0       5,690        5,690            0.0  pthread_kill
      0.0            4,801          7         685.9         659.0         590          894           98.5  dup
      0.0            3,453          1       3,453.0       3,453.0       3,453        3,453            0.0  fopen64
      0.0            1,957          1       1,957.0       1,957.0       1,957        1,957            0.0  bind
      0.0            1,814          2         907.0         907.0         119        1,695        1,114.4  pthread_cond_signal
      0.0              944          1         944.0         944.0         944          944            0.0  listen
      0.0              913         16          57.1          27.0          26          391           90.9  fflush

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)            Name
 --------  ---------------  ---------  ------------  ------------  ----------  ----------  ------------  ----------------------
     78.2       91,806,656          2  45,903,328.0  45,903,328.0       3,965  91,802,691  64,911,501.7  cudaMalloc
     21.6       25,369,342          1  25,369,342.0  25,369,342.0  25,369,342  25,369,342           0.0  cudaDeviceReset
      0.1          109,083          1     109,083.0     109,083.0     109,083     109,083           0.0  cuLibraryLoadData
      0.0           30,035          2      15,017.5      15,017.5      12,431      17,604       3,657.9  cudaMemcpy
      0.0           15,844          1      15,844.0      15,844.0      15,844      15,844           0.0  cudaLaunchKernel
      0.0            8,651          1       8,651.0       8,651.0       8,651       8,651           0.0  cudaDeviceSynchronize
      0.0            1,993          1       1,993.0       1,993.0       1,993       1,993           0.0  cuCtxSynchronize
      0.0              954          1         954.0         954.0         954         954           0.0  cuModuleGetLoadingMode

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)              Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  -----------------------------
    100.0            1,920          1   1,920.0   1,920.0     1,920     1,920          0.0  axpy(float, float *, float *)

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)      Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ------------------
     74.4            1,024      1   1,024.0   1,024.0     1,024     1,024          0.0  [CUDA memcpy DtoH]
     25.6              352      1     352.0     352.0       352       352          0.0  [CUDA memcpy HtoD]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
      0.000      1     0.000     0.000     0.000     0.000        0.000  [CUDA memcpy DtoH]
      0.000      1     0.000     0.000     0.000     0.000        0.000  [CUDA memcpy HtoD]

However, when I try to run that with my program, it fails with

Generating '/tmp/nsys-report-3ad6.qdstrm'
[1/8] [=====32%                    ] prof_my.nsys-rep
Importer error status: An unknown error occurred.
Generated:
    /home/liuxs/workarea/log/nsys/prof_my.qdstrm

The code is designed for multithread and multistreams, but I’ve limited it to be single thread+stream when profiling.
How to solve this unknown error and how to expose more details about the crash?

I see this post. The qdstrm generated on my program is about 860MB, I guess it is just too large to be processed at run time with only 16GB memory. Is there any way to deal with this problem? How to further process this file offline?

Are you just running “nsys profile appname”?

My first suggestion would be to limit the scope of the profile. Usually just a few seconds of the application running is enough to give you the information you need. You can do that by using the --duration or --delay options from the command line.

If most of what you are interested in is the CUDA, you can also limit what information is being collected.

“nsys profile --trace=CUDA --sample=none appname” will markedly decrease the amount of information collected.

You do not need the application to be single thread/stream.

1 Like

Thanks! It works for me right now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.