GPU error on boot / Xorg crashing

I am new to the nano, working with a dev kit. I am using Balena (www.balena.io) to build a scalable IIoT app and evaluating whether the nano will meet our needs. Balena uses docker containers on top of a Yocto core, and they support the nano. I built a basic Ubuntu container and installed the L4T package. When I run startx I am getting an error:

X.Org X Server 1.19.6
Release Date: 2017-12-20
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.4.0-168-generic aarch64 Ubuntu
Current Operating System: Linux 95295ec09afd 4.9.140-l4t-r32.3.1 #1 SMP PREEMPT Thu Mar 5 14:37:57 UTC 2020 aarch64
Kernel command line: tegraid=21.1.2.0.0 ddr_die=4096M@2048M section=512M memtype=0 vpr_resize usb_port_owner_info=0 lane_owner_info=0 emc_max_dvfs=0 touch_id=0@63 video=tegrafb no_console_suspend=1 console=ttyS0,115200n8 debug_uartport=lsport,0 earlyprintk=uart8250-32bit,0x70006000 maxcpus=4 usbcore.old_scheme_first=1 lp0_vec=0x1000@0xff780000 core_edp_mv=1075 core_edp_ma=4000 tegra_fbmem=0x800000@0x92cb4000 is_hdmi_initialised=1  console=ttyS0,115200 console=tty0 fbcon=map:0 net.ifnames=0 root=PARTUUID=1fe4f4f7-ea07-4438-aaa3-23126c5c3897 ro rootwait  sdhci_tegra.en_boot_part_access=1
Build Date: 14 November 2019  06:20:13PM
xorg-server 2:1.19.6-1ubuntu4.4 (For technical support please see http://www.ubuntu.com/support) 
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Thu Jun 25 03:00:36 2020
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(EE) 
Fatal server error:
(EE) NVIDIA: A GPU exception occurred during X server initialization
(EE) 
(EE) 
Please consult the The X.Org Foundation support 
         at http://wiki.x.org
 for help. 
(EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
(EE) 

And the device reboots itself right after. When I check the dmesg logs on startup, I see some NVGPU errors (shown below). Oddly, it worked the first few times, but now I can’t get it back to a working state. Does anyone have ideas of what could be wrong based on the errors?

[   15.873487] EXT4-fs (mmcblk0p13): couldn't mount as ext3 due to feature incompatibilities
[   15.889358] EXT4-fs (mmcblk0p13): couldn't mount as ext2 due to feature incompatibilities
[   15.914654] EXT4-fs (mmcblk0p13): mounted filesystem with ordered data mode. Opts: (null)
[   15.989912] EXT4-fs (mmcblk0p15): mounted filesystem with ordered data mode. Opts: (null)
[   17.300019] EXT4-fs (mmcblk0p13): re-mounted. Opts: (null)
[   17.904449] cgroup: cgroup2: unknown option "nsdelegate"
[   18.464936] systemd[1]: File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
[   18.495239] systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[   18.828622] systemd[1]: /lib/systemd/system/chronyd.service:9: PIDFile= references path below legacy directory /var/run/, updating /var/run/chrony/chronyd.pid → /run/chrony/chronyd.pid; please update the unit file accordingly.
[   19.817005] nvgpu: 57000000.gpu           gm20b_init_clk_setup_sw:1268 [INFO]  GPCPLL initial settings: NA mode, M=1, N=34, P=3 (id = 1)
[   20.925707] EXT4-fs (mmcblk0p16): mounted filesystem with ordered data mode. Opts: (null)
[   23.898590] usbcore: registered new interface driver btusb
[   23.927303] usbcore: registered new interface driver usbserial
[   23.947392] usbcore: registered new interface driver ch341
[   23.956789] usbserial: USB Serial support registered for ch341-uart
[   23.968582] ch341 1-2.2:1.0: ch341-uart converter detected
[   23.984118] usb 1-2.2: ch341-uart converter now attached to ttyUSB0
[   24.037538] input: HID 222a:0001 as /devices/70090000.xusb/usb1/1-2/1-2.1/1-2.1:1.0/0003:222A:0001.0001/input/input2
[   24.061839] hid-multitouch 0003:222A:0001.0001: input,hidraw1: USB HID v1.10 Device [HID 222a:0001] on usb-70090000.xusb-2.1/input0
[   30.827707] nvgpu: 57000000.gpu   __nvgpu_timeout_expired_msg_cpu:94   [ERR]  Timeout detected @ gr_gk20a_ctx_wait_ucode+0xb8/0x3b0 [nvgpu] 
[   30.847212] nvgpu: 57000000.gpu           gr_gk20a_ctx_wait_ucode:528  [ERR]  timeout waiting on mailbox=0 value=0x00000010
[   30.863607] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:129  [ERR]  gr_fecs_os_r : 0
[   30.877404] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:131  [ERR]  gr_fecs_cpuctl_r : 0x40
[   30.891799] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:133  [ERR]  gr_fecs_idlestate_r : 0x1
[   30.906448] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:135  [ERR]  gr_fecs_mailbox0_r : 0x0
[   30.920999] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:137  [ERR]  gr_fecs_mailbox1_r : 0x0
[   30.935673] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:139  [ERR]  gr_fecs_irqstat_r : 0x0
[   30.949985] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:141  [ERR]  gr_fecs_irqmode_r : 0x4
[   30.964328] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:143  [ERR]  gr_fecs_irqmask_r : 0x8704
[   30.978887] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:145  [ERR]  gr_fecs_irqdest_r : 0x0
[   30.993197] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:147  [ERR]  gr_fecs_debug1_r : 0x40
[   31.007445] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:149  [ERR]  gr_fecs_debuginfo_r : 0x0
[   31.021890] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:151  [ERR]  gr_fecs_ctxsw_status_1_r : 0x300
[   31.037007] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(0) : 0x10
[   31.052244] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(1) : 0x0
[   31.067336] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(2) : 0x41009
[   31.082696] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(3) : 0x20
[   31.097752] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(4) : 0x1ffda0
[   31.113075] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(5) : 0x0
[   31.127894] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(6) : 0x0
[   31.142693] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(7) : 0x0
[   31.157433] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(8) : 0x0
[   31.172126] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(9) : 0x0
[   31.186720] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(10) : 0x0
[   31.201402] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(11) : 0x0
[   31.216056] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(12) : 0x0
[   31.230615] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(13) : 0x0
[   31.245059] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(14) : 0x0
[   31.259421] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(15) : 0x0
[   31.273564] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:159  [ERR]  gr_fecs_engctl_r : 0x0
[   31.286617] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:161  [ERR]  gr_fecs_curctx_r : 0x0
[   31.300150] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:163  [ERR]  gr_fecs_nxtctx_r : 0x0
[   31.312884] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:169  [ERR]  FECS_FALCON_REG_IMB : 0xbadfbadf
[   31.326628] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:175  [ERR]  FECS_FALCON_REG_DMB : 0xbadfbadf
[   31.326654] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:181  [ERR]  FECS_FALCON_REG_CSW : 0xbadfbadf
[   31.326665] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:187  [ERR]  FECS_FALCON_REG_CTX : 0xbadfbadf
[   31.326675] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:193  [ERR]  FECS_FALCON_REG_EXCI : 0xbadfbadf
[   31.326685] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[   31.326694] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[   31.326704] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[   31.326713] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[   31.326722] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[   31.326732] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[   31.326741] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0xbadfbadf
[   31.326750] nvgpu: 57000000.gpu      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0xbadfbadf
[   31.326768] NV_PGRAPH_STATUS: 0xa1
[   31.326775] NV_PGRAPH_STATUS1: 0x0
[   31.326781] NV_PGRAPH_STATUS2: 0x0
[   31.326788] NV_PGRAPH_ENGINE_STATUS: 0x1
[   31.326794] NV_PGRAPH_GRFIFO_STATUS : 0x1
[   31.326801] NV_PGRAPH_GRFIFO_CONTROL : 0x10001
[   31.326808] NV_PGRAPH_PRI_FECS_HOST_INT_STATUS : 0x0
[   31.326814] NV_PGRAPH_EXCEPTION  : 0x0
[   31.326820] NV_PGRAPH_FECS_INTR  : 0x0
[   31.326826] NV_PFIFO_ENGINE_STATUS(GR) : 0x90000000
[   31.326832] NV_PGRAPH_ACTIVITY0: 0x0
[   31.326839] NV_PGRAPH_ACTIVITY1: 0x600
[   31.326844] NV_PGRAPH_ACTIVITY2: 0x0
[   31.326850] NV_PGRAPH_ACTIVITY4: 0x0
[   31.326856] NV_PGRAPH_PRI_SKED_ACTIVITY: 0x0
[   31.326863] NV_PGRAPH_PRI_GPC0_GPCCS_GPC_ACTIVITY0: 0x0
[   31.326870] NV_PGRAPH_PRI_GPC0_GPCCS_GPC_ACTIVITY1: 0x0
[   31.326876] NV_PGRAPH_PRI_GPC0_GPCCS_GPC_ACTIVITY2: 0x0
[   31.326883] NV_PGRAPH_PRI_GPC0_GPCCS_GPC_ACTIVITY3: 0x0
[   31.326890] NV_PGRAPH_PRI_GPC0_TPC0_TPCCS_TPC_ACTIVITY0: 0x0
[   31.326897] NV_PGRAPH_PRI_GPC0_TPCS_TPCCS_TPC_ACTIVITY0: 0x0
[   31.326904] NV_PGRAPH_PRI_GPCS_GPCCS_GPC_ACTIVITY0: 0x0
[   31.326911] NV_PGRAPH_PRI_GPCS_GPCCS_GPC_ACTIVITY1: 0x0
[   31.326918] NV_PGRAPH_PRI_GPCS_GPCCS_GPC_ACTIVITY2: 0x0
[   31.326924] NV_PGRAPH_PRI_GPCS_GPCCS_GPC_ACTIVITY3: 0x0
[   31.326931] NV_PGRAPH_PRI_GPCS_TPC0_TPCCS_TPC_ACTIVITY0: 0x0
[   31.326939] NV_PGRAPH_PRI_GPCS_TPCS_TPCCS_TPC_ACTIVITY0: 0x0
[   31.326946] NV_PGRAPH_PRI_BE0_BECS_BE_ACTIVITY0: 0x0
[   31.326952] NV_PGRAPH_PRI_BE1_BECS_BE_ACTIVITY0: 0x0
[   31.326959] NV_PGRAPH_PRI_BES_BECS_BE_ACTIVITY0: 0x0
[   31.326966] NV_PGRAPH_PRI_DS_MPIPE_STATUS: 0x0
[   31.326973] NV_PGRAPH_PRI_FE_GO_IDLE_ON_STATUS: 0x2e
[   31.326980] NV_PGRAPH_PRI_FE_GO_IDLE_TIMEOUT : 0x800
[   31.326986] NV_PGRAPH_PRI_FE_GO_IDLE_CHECK : 0x800
[   31.326993] NV_PGRAPH_PRI_FE_GO_IDLE_INFO : 0x23000700
[   31.327001] NV_PGRAPH_PRI_GPC0_TPC0_TEX_M_TEX_SUBUNITS_STATUS: 0x0
[   31.327008] NV_PGRAPH_PRI_CWD_FS: 0x101
[   31.327014] NV_PGRAPH_PRI_FE_TPC_FS: 0x1
[   31.327020] NV_PGRAPH_PRI_CWD_GPC_TPC_ID(0): 0x0
[   31.327027] NV_PGRAPH_PRI_CWD_SM_ID(0): 0x0
[   31.327033] NV_PGRAPH_PRI_FECS_CTXSW_STATUS_FE_0: 0x2000
[   31.327039] NV_PGRAPH_PRI_FECS_CTXSW_STATUS_1: 0x300
[   31.327046] NV_PGRAPH_PRI_GPC0_GPCCS_CTXSW_STATUS_GPC_0: 0x0
[   31.327053] NV_PGRAPH_PRI_GPC0_GPCCS_CTXSW_STATUS_1: 0x300
[   31.327060] NV_PGRAPH_PRI_FECS_CTXSW_IDLESTATE : 0xe
[   31.327067] NV_PGRAPH_PRI_GPC0_GPCCS_CTXSW_IDLESTATE : 0xf
[   31.327074] NV_PGRAPH_PRI_FECS_CURRENT_CTX : 0x807ff65e
[   31.327080] NV_PGRAPH_PRI_FECS_NEW_CTX : 0x807ff65e
[   31.327088] NV_PGRAPH_PRI_BE0_CROP_STATUS1 : 0x5800000
[   31.327095] NV_PGRAPH_PRI_BES_CROP_STATUS1 : 0x5800000
[   31.327102] NV_PGRAPH_PRI_BE0_ZROP_STATUS : 0x0
[   31.327109] NV_PGRAPH_PRI_BE0_ZROP_STATUS2 : 0x0
[   31.327116] NV_PGRAPH_PRI_BES_ZROP_STATUS : 0x0
[   31.327123] NV_PGRAPH_PRI_BES_ZROP_STATUS2 : 0x0
[   31.327129] NV_PGRAPH_PRI_BE0_BECS_BE_EXCEPTION: 0x0
[   31.327136] NV_PGRAPH_PRI_BE0_BECS_BE_EXCEPTION_EN: 0x0
[   31.327144] NV_PGRAPH_PRI_GPC0_GPCCS_GPC_EXCEPTION: 0x0
[   31.327151] NV_PGRAPH_PRI_GPC0_GPCCS_GPC_EXCEPTION_EN: 0x30000
[   31.327158] NV_PGRAPH_PRI_GPC0_TPC0_TPCCS_TPC_EXCEPTION: 0x0
[   31.327165] NV_PGRAPH_PRI_GPC0_TPC0_TPCCS_TPC_EXCEPTION_EN: 0x3
[   31.327179] nvgpu: 57000000.gpu    gr_gk20a_submit_fecs_method_op:579  [ERR]  fecs method: data=0x807ff65e push adr=0x00000009
[   31.327189] nvgpu: 57000000.gpu      gr_gk20a_fecs_ctx_image_save:1367 [ERR]  save context image failed

Hi,

I can only suggest you not to use docker for now and just directly use the pure sdcard image or sdkmanager to verify if your board can really have gpu working with our release.

This is to clarify if there is any hardware problem on your board.