GPU write errors with nvprof

When using nvprof on my XavierNX, I see frequent GPU driver error messages such as:

[ 7273.686545] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:121  [ERR]  ringmaster intr status0: 0x00000100,status1: 0x00000001
[ 7273.686799] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:149  [ERR]  SYS write error. ADR 0x00406004 WRDAT 0x00000200 INFO 0x1f408210 (subid 0x0000001f priv level 0), CODE 0xbadf1201
[ 7273.687079] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:79   [ERR]  client timeout
[ 7273.687232] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:175  [ERR]  GPC0 write error. ADR 0x0041bfec WRDAT 0xfffffffe INFO 0x18408215 (subid 0x00000018 priv level 0), CODE 0xbadf1002
[ 7273.687513] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:79   [ERR]  client timeout

This looks similar to this topic:

but it happens even with cuda samples, such as matrixMul or vectorAdd, and even with deviceQuery.

Seems it happens more frequently :

  • when launched from a GUI terminal than from a virtual console.
  • when using CUDA memory calls such as cudaMemcpy or cudaMemset

In my case the occurrence is very high, so it may be easy to reproduce.

For reference, I’m using a quite standard NX devkit :

cat /etc/nv_tegra_release 
# R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t186ref, EABI: aarch64, DATE: Fri Oct 16 19:37:08 UTC 2020

uname -a
Linux Xavier-NX 4.9.140-tegra #1 SMP PREEMPT Tue Oct 27 21:02:46 PDT 2020 aarch64 aarch64 aarch64 GNU/Linux

cat /usr/local/cuda/version.txt 
CUDA Version 10.2.89

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_21:14:42_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

sudo /usr/local/cuda/bin/nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.2.89 (21)

Hi,

Thanks for your reporting.
Confirmed that we can reproduce this issue in our environment.
We are checking this with our internal team. Will get back to you later.

Thanks.

Seems it is not yet fixed in R32.5.1…
Is nvprof getting deprecated or will it be available again for simple native CUDA profiling ?

I’d suggest to add profiling twice or more cuda samples into nvprof QA test suite. First run may pass, but a few don’t pass.

Well, I was expecting an answer but …
Today I’ve also seen a GPU fault without using nvprof. Not sure it is related.
I think it happened when trying to display an H264 encoded video with Totem, but I failed to reproduce it.
For reference, the relevant syslog is below. I’d futher suggest to add kernel timing to tegrastats, so that it would be much easier to see tegratstats when the fault happened.

Mar  2 23:11:14 Xavier-NX dbus-daemon[7442]: [session uid=1000 pid=7442] Activating service name='org.gnome.Totem' requested by ':1.107' (uid=1000 pid=10460 comm="/usr/bin/nautilus --gapplication-service ")
Mar  2 23:11:15 Xavier-NX /usr/lib/gdm3/gdm-x-session[7427]: (--) NVIDIA(GPU-0): Samsung S34J55x (DFP-0): connected
Mar  2 23:11:15 Xavier-NX /usr/lib/gdm3/gdm-x-session[7427]: (--) NVIDIA(GPU-0): Samsung S34J55x (DFP-0): External TMDS
Mar  2 23:11:15 Xavier-NX /usr/lib/gdm3/gdm-x-session[7427]: (--) NVIDIA(GPU-0): Samsung S34J55x (DFP-0) Name Aliases:
Mar  2 23:11:15 Xavier-NX /usr/lib/gdm3/gdm-x-session[7427]: (--) NVIDIA(GPU-0):   DFP
Mar  2 23:11:15 Xavier-NX /usr/lib/gdm3/gdm-x-session[7427]: (--) NVIDIA(GPU-0):   DFP-0
Mar  2 23:11:15 Xavier-NX /usr/lib/gdm3/gdm-x-session[7427]: (--) NVIDIA(GPU-0):   DPY-0
Mar  2 23:11:15 Xavier-NX /usr/lib/gdm3/gdm-x-session[7427]: (--) NVIDIA(GPU-0):   HDMI-0
Mar  2 23:11:15 Xavier-NX /usr/lib/gdm3/gdm-x-session[7427]: (--) NVIDIA(GPU-0):   DPY-EDID-6c7fa4d9-4543-f1d3-0379-6c61d3c1aed9
Mar  2 23:11:15 Xavier-NX /usr/lib/gdm3/gdm-x-session[7427]: (--) NVIDIA(GPU-0):   HDMI-0
Mar  2 23:11:15 Xavier-NX dbus-daemon[7442]: [session uid=1000 pid=7442] Successfully activated service 'org.gnome.Totem'
Mar  2 23:11:16 Xavier-NX org.gnome.Totem[7442]: Opening in BLOCKING MODE
Mar  2 23:11:16 Xavier-NX totem[10604]: gst_caps_is_empty: assertion 'GST_IS_CAPS (caps)' failed
Mar  2 23:11:16 Xavier-NX totem[10604]: gst_caps_truncate: assertion 'GST_IS_CAPS (caps)' failed
Mar  2 23:11:16 Xavier-NX org.gnome.Totem[7442]: NvMMLiteOpen : Block : BlockType = 261
Mar  2 23:11:16 Xavier-NX totem[10604]: gst_caps_fixate: assertion 'GST_IS_CAPS (caps)' failed
Mar  2 23:11:16 Xavier-NX totem[10604]: gst_caps_get_structure: assertion 'GST_IS_CAPS (caps)' failed
Mar  2 23:11:16 Xavier-NX totem[10604]: gst_structure_get_string: assertion 'structure != NULL' failed
Mar  2 23:11:16 Xavier-NX totem[10604]: gst_mini_object_unref: assertion 'mini_object != NULL' failed
Mar  2 23:11:16 Xavier-NX org.gnome.Totem[7442]: NVMEDIA: Reading vendor.tegra.display-size : status: 6
Mar  2 23:11:16 Xavier-NX org.gnome.Totem[7442]: NvMMLiteBlockCreate : Block : BlockType = 261
Mar  2 23:11:16 Xavier-NX org.gnome.Totem[7442]: Allocating new output: 2000x1504 (x 16), ThumbnailMode = 0
Mar  2 23:11:16 Xavier-NX org.gnome.Totem[7442]: OPENMAX: HandleNewStreamFormat: 3605: Send OMX_EventPortSettingsChanged: nFrameWidth = 2000, nFrameHeight = 1500
Mar  2 23:11:17 Xavier-NX kernel: [ 8202.542273] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:121  [ERR]  ringmaster intr status0: 0x00000000,status1: 0x00000001
Mar  2 23:11:17 Xavier-NX kernel: [ 8202.542525] nvgpu: 17000000.gv11b               gp10b_priv_ring_isr:175  [ERR]  GPC0 write error. ADR 0x00418020 WRDAT 0x00000000 INFO 0x1d40822a (subid 0x0000001d priv level 0), CODE 0xbadf1201
Mar  2 23:11:17 Xavier-NX kernel: [ 8202.542807] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:79   [ERR]  client timeout
Mar  2 23:11:17 Xavier-NX kernel: [ 8202.542958] nvgpu: 17000000.gv11b                  gk20a_ptimer_isr:50   [ERR]  PRI timeout: ADR 0x00418020 READ  DATA 0x00000000
Mar  2 23:11:17 Xavier-NX kernel: [ 8202.543147] nvgpu: 17000000.gv11b                  gk20a_ptimer_isr:56   [ERR]  FECS_ERRCODE 0xbadf1201
Mar  2 23:11:17 Xavier-NX kernel: [ 8202.543298] nvgpu: 17000000.gv11b gp10b_priv_ring_decode_error_code:79   [ERR]  client timeout

Hi,

Sorry for the late reply.
For some internal reason, the priority of nvprof issue is limited.

We also have some profiling tool that supports Jetson.
Maybe you can check them as a temporal solution:
NVIDIA Developer Tools Overview | NVIDIA Developer

Thanks.