No NVIDIA GPU available or detected on Nvidia Jetson Orin Nano

Hi,

I’m facing an issue on an Nvidia Jetson Orin Nano where the GPU is not being detected.

PyTorch says that cuda is not available:

fov@marvel-fov-8:~$ python
Python 3.8.10 (default, May 26 2023, 14:05:08) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print("CUDA is available" if torch.cuda.is_available() else "CUDA is not available")
CUDA is not available

The logs from running sudo jtop say that no GPUs are available:

May 22 14:15:59 marvel-fov-8 systemd[1]: Started jtop service.
May 22 14:15:59 marvel-fov-8 systemd[472530]: jtop.service: Failed to execute command: No such file or directory
May 22 14:15:59 marvel-fov-8 systemd[472530]: jtop.service: Failed at step EXEC spawning /usr/local/bin/jtop: No such file or directory
May 22 14:15:59 marvel-fov-8 systemd[1]: jtop.service: Main process exited, code=exited, status=203/EXEC
May 22 14:15:59 marvel-fov-8 systemd[1]: jtop.service: Failed with result 'exit-code'.
May 22 14:16:09 marvel-fov-8 systemd[1]: jtop.service: Scheduled restart job, restart counter is at 1.
May 22 14:16:09 marvel-fov-8 systemd[1]: Stopped jtop service.
May 22 14:16:09 marvel-fov-8 systemd[1]: Started jtop service.
May 22 14:16:09 marvel-fov-8 jtop[472547]: [INFO] jtop.core.config - Build service folder in /usr/local/jtop
May 22 14:16:09 marvel-fov-8 jtop[472547]: [INFO] jtop.service - jetson_stats 4.2.8 - server loaded
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.service - Running on Python: 3.8.10
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.hardware - Hardware detected aarch64
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.hardware - NVIDIA Jetson 699-level Part Number=699-13767-0005-300 K.2
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.hardware - NVIDIA Jetson Module=NVIDIA Jetson Orin Nano (Developer kit)
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=35.3.1
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.cpu - Found 6 CPU
**May 22 14:16:10 marvel-fov-8 jtop[472547]: [WARNING] jtop.core.gpu - No NVIDIA GPU available**
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.processes - Process service started
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.memory - Found EMC!
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.memory - Memory service started
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.engine - Engines found: [APE NVDEC NVENC NVJPG OFA SE VIC]
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.temperature - Found thermal "CV0" in thermal_zone2
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.temperature - Found thermal "CPU" in thermal_zone0
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.temperature - Found thermal "SOC2" in thermal_zone7
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.temperature - Found thermal "SOC0" in thermal_zone5
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.temperature - Found thermal "CV1" in thermal_zone3
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.temperature - Found thermal "GPU" in thermal_zone1
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.temperature - Found thermal "tj" in thermal_zone8
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.temperature - Found thermal "SOC1" in thermal_zone6
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.temperature - Found thermal "CV2" in thermal_zone4
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.power - Alarms VDD_IN - {'crit_alarm': 0, 'max_alarm': 0}
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.power - Alarms VDD_CPU_GPU_CV - {'crit_alarm': 0, 'max_alarm': 0}
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.power - Alarms VDD_SOC - {'crit_alarm': 0, 'max_alarm': 0}
May 22 14:16:10 marvel-fov-8 jtop[472547]: [WARNING] jtop.core.power - Skipped "sum of shunt voltages" /sys/bus/i2c/devices/1-0040/hwmon/hwmon3/in7_label
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.power - Found I2C power monitor
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.fan - Fan pwmfan(1) found in /sys/class/hwmon/hwmon2
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.fan - RPM pwm_tach found in /sys/class/hwmon/hwmon0
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.fan - Found nvfancontrol.service
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.jetson_clocks - jetson_clocks found in /usr/bin/jetson_clocks
May 22 14:16:10 marvel-fov-8 jtop[472547]: [INFO] jtop.core.nvpmodel - nvpmodel running in [0]15W - Default: 0
May 22 14:16:10 marvel-fov-8 jtop[472573]: [INFO] jtop.service - Initialization service
May 22 14:16:11 marvel-fov-8 jtop[472573]: Process JtopServer-1:
May 22 14:16:11 marvel-fov-8 jtop[472573]: Traceback (most recent call last):
May 22 14:16:11 marvel-fov-8 jtop[472573]:   File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
May 22 14:16:11 marvel-fov-8 jtop[472573]:     self.run()
May 22 14:16:11 marvel-fov-8 jtop[472573]:   File "/usr/local/lib/python3.8/dist-packages/jtop/service.py", line 319, in run
May 22 14:16:11 marvel-fov-8 jtop[472573]:     self.jetson_clocks.initialization(self.nvpmodel, data)
May 22 14:16:11 marvel-fov-8 jtop[472573]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/jetson_clocks.py", line 370, in initialization
May 22 14:16:11 marvel-fov-8 jtop[472573]:     self._engines_list = self.show()
May 22 14:16:11 marvel-fov-8 jtop[472573]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/jetson_clocks.py", line 522, in show
May 22 14:16:11 marvel-fov-8 jtop[472573]:     lines = cmd(timeout=COMMAND_TIMEOUT)
May 22 14:16:11 marvel-fov-8 jtop[472573]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/command.py", line 115, in __call__
May 22 14:16:11 marvel-fov-8 jtop[472573]:     raise Command.CommandException('Error process:', self.process.returncode)
May 22 14:16:11 marvel-fov-8 jtop[472573]: jtop.core.command.Command.CommandException: [errno:1] Error process:
May 22 14:16:11 marvel-fov-8 jtop[472547]: [INFO] jtop.service - Service closed
May 22 14:16:11 marvel-fov-8 systemd[1]: jtop.service: Succeeded.

With the most notable line above being:May 22 14:16:10 marvel-fov-8 jtop[472547]: [WARNING] jtop.core.gpu - No NVIDIA GPU available

This issue occurred after the Jetson Orin Nano suddenly rebooted while running an intensive python script. Notably other Orin Nano devices we have also suddenly rebooted, but none of them faced this issue where the GPU is not being detected.

Here is the device info:

Hi,

>>> print("CUDA is available" if torch.cuda.is_available() else "CUDA is not available")

This message should indicate the package doesn’t build with CUDA support rather than no GPU found.

Could you use our prebuilt instead and try it again:

Thanks.

Hi @AastaLLL

I actually did install things using your pre-built binary, here is my pytorch section of my install script:

export TORCH_INSTALL=https://developer.download.nvidia.cn/compute/redist/jp/v511/pytorch/torch-2.0.0+nv23.05-cp38-cp38-linux_aarch64.whl
pip install --no-cache $TORCH_INSTALL

The logs from jtop seem to suggest a more serious issue with the device not detecting the GPU I think… jtop itself won’t work.

You can see the logs above for it, here is what it looks like when I try to run it:

fov@marvel-fov-8:~$ sudo jtop
The jtop.service is not active. Please run:
sudo systemctl restart jtop.service
fov@marvel-fov-8:~$ sudo systemctl restart jtop.service
fov@marvel-fov-8:~$ sudo jtop
The jtop.service is not active. Please run:
sudo systemctl restart jtop.service
fov@marvel-fov-8:~$ sudo systemctl status jtop.service
● jtop.service - jtop service
     Loaded: loaded (/etc/systemd/system/jtop.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Fri 2024-05-31 11:27:52 BST; 16s ago
    Process: 69259 ExecStart=/usr/local/bin/jtop --force (code=exited, status=0/SUCCESS)
   Main PID: 69259 (code=exited, status=0/SUCCESS)

May 31 11:27:51 marvel-fov-8 jtop[69278]:     self.jetson_clocks.initialization(self.nvpmodel, data)
May 31 11:27:51 marvel-fov-8 jtop[69278]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/jetson_clocks.p>
May 31 11:27:51 marvel-fov-8 jtop[69278]:     self._engines_list = self.show()
May 31 11:27:51 marvel-fov-8 jtop[69278]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/jetson_clocks.p>
May 31 11:27:51 marvel-fov-8 jtop[69278]:     lines = cmd(timeout=COMMAND_TIMEOUT)
May 31 11:27:51 marvel-fov-8 jtop[69278]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/command.py", li>
May 31 11:27:51 marvel-fov-8 jtop[69278]:     raise Command.CommandException('Error process:', self.process.return>
May 31 11:27:51 marvel-fov-8 jtop[69278]: jtop.core.command.Command.CommandException: [errno:1] Error process:
May 31 11:27:52 marvel-fov-8 jtop[69259]: [INFO] jtop.service - Service closed
May 31 11:27:52 marvel-fov-8 systemd[1]: jtop.service: Succeeded.

After the sudden reboot, I tried to run a tensorrt inference script that was working fine before the sudden reboot, however was unable to due to this error which appeared:

$ python tensorrt_script.py

[05/22/2024-14:10:44] [TRT] [W] CUDA initialization failure with error: 100
Traceback (most recent call last):
  File "tensorrt_script.py", line 307, in <module>
    main()
  File "record_images_and_detect.py", line 301, in main
    tensorrt_model = TensorRTInference(ENGINE_PATH)
  File "/home/fov/Desktop/FOVCamerasWebApp/jetson/tensorrt_inference.py", line 33, in __init__
    self.engine = self.load_engine(engine_path)
  File "/home/fov/Desktop/FOVCamerasWebApp/jetson/tensorrt_inference.py", line 40, in load_engine
    with open(engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
TypeError: pybind11::init(): factory function returned nullptr

Again, running pytorch with cuda/ on the gpu, and running TensorRT worked fine just before the device suddenly rebooted.

I’m not too sure how to proceed with debugging this.

Thanks
Tim

Hi,

Could you test our deviceQuery sample to check GPU functionality?
Suppose you are using JetPack 6, please run the following:

$ git clone https://github.com/NVIDIA/cuda-samples.git
$ cd cuda-samples/Samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery

Thanks.

@AastaLLL I’m running Jetpack 5.1.1 and the Jetson is deployed remotely in a different country, so I can’t reflash it to upgrade the jetpack version.

Will this work for 5.11 too?

@AastaLLL I git cloned the version of that repo that corresponded to cuda toolkit 11.4 which is what is on my Jetson Orin Nano.

Here is the output:

fov@marvel-fov-8:~/Desktop/cuda-samples/Samples/deviceQuery$ make
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common  -m64    --threads 0 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o deviceQuery.o -c deviceQuery.cpp
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o deviceQuery deviceQuery.o 
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
mkdir -p ../../bin/aarch64/linux/release
cp deviceQuery ../../bin/aarch64/linux/release
fov@marvel-fov-8:~/Desktop/cuda-samples/Samples/deviceQuery$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100
-> no CUDA-capable device is detected
Result = FAIL

Hi,

It looks like the GPU is somehow in a bad status.

Are you able to reboot the device?
Could you check if rebooting helps to recover the GPU back to normal?

Thanks.

Hi,

Are you able to reflash your device with 5.1.2 or 6.0GA?

If the device is still under warranty, please check our RMA process:

Thanks.

Hi @AastaLLL

Unfortunately running sudo reboot hasn’t changed anything, here is the output of deviceQuery after rebooting (its the same):

fov@marvel-fov-8:~/Desktop/cuda-samples/Samples$ cd deviceQuery
fov@marvel-fov-8:~/Desktop/cuda-samples/Samples/deviceQuery$ make
make: Nothing to be done for 'all'.
fov@marvel-fov-8:~/Desktop/cuda-samples/Samples/deviceQuery$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 100
-> no CUDA-capable device is detected
Result = FAIL

Is there any other debugging things or anything I can do? This device is deployed and installed remotely in a different country and its quite hard to get it replaced (would be at quite a cost both in time and money for us).

Thanks
Tim

Wasn’t able to try updating it yet - will update you tmrw when I have tried upgrading to 5.1.3 via apt. Not able to flash the sd card as its remotely deployed

Hi,

Any luck after apt upgrade?

Thanks.

Hi,

Could you also share the kernel module info with us?

$ find /lib/modules/$(uname -r) -type f -name 'nvidia*.ko*'

Thanks.

Unfortunately I’m getting errors when I try to run sudo apt upgrade

When I first tried running sudo apt upgrade, I got this error:

Import process completed.
Done
done.
Processing triggers for libgdk-pixbuf2.0-0:arm64 (2.40.0+dfsg-3ubuntu0.5) ...
Errors were encountered while processing:
 nfs-common
 openssh-server
E: Sub-process /usr/bin/dpkg returned an error code (1)

Before the above, this was printed to the terminal:

Setting up vim (2:8.1.2269-1ubuntu5.23) ...
Setting up gvfs:arm64 (1.44.1-1ubuntu1.2) ...
Setting up libgs9:arm64 (9.50~dfsg-5ubuntu4.11) ...
Setting up openssh-server (1:8.2p1-4ubuntu0.11) ...
dpkg: error processing package openssh-server (--configure):
installed openssh-server package post-installation script subprocess returned error exit status 10
Setting up python3-talloc:arm64 (2.3.3-0ubuntu0.20.04.1) ...
Setting up software-properties-gtk (0.99.9.12) ...
Setting up libfwupd2:arm64 (1.7.9-1~20.04.3) ...

Running sudo apt install --fix-broken doesn’t help fix anything:

fov@marvel-fov-8:~$ sudo apt install --fix-broken
[sudo] password for fov: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  apt-clone archdetect-deb bogl-bterm busybox-static cryptsetup-bin dctrl-tools dpkg-repack gir1.2-goa-1.0 gir1.2-timezonemap-1.0
  gir1.2-xkl-1.0 grub-common libdebian-installer4 libfwupdplugin1 libtimezonemap-data libtimezonemap1 libxmlb1 os-prober
  python3-icu python3-pam rdate tasksel tasksel-data
Use 'sudo apt autoremove' to remove them.
0 to upgrade, 0 to newly install, 0 to remove and 0 not to upgrade.
2 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Setting up openssh-server (1:8.2p1-4ubuntu0.11) ...
dpkg: error processing package openssh-server (--configure):
 installed openssh-server package post-installation script subprocess returned error exit status 10
Setting up nfs-common (1:1.3.4-2.5ubuntu3.7) ...
dpkg: error processing package nfs-common (--configure):
 installed nfs-common package post-installation script subprocess returned error exit status 10
Errors were encountered while processing:
 openssh-server
 nfs-common
E: Sub-process /usr/bin/dpkg returned an error code (1)

Here is kernel module info:

fov@marvel-fov-8:~$ find /lib/modules/$(uname -r) -type f -name 'nvidia*.ko*'
find: ‘/lib/modules/5.10.104-tegra/kernel/net/ipv6’: Structure needs cleaning
find: ‘/lib/modules/5.10.104-tegra/kernel/net/dsa’: Structure needs cleaning
find: ‘/lib/modules/5.10.104-tegra/kernel/drivers/gpu/nvgpu’: Structure needs cleaning
/lib/modules/5.10.104-tegra/kernel/drivers/nv-p2p/nvidia-p2p.ko
find: ‘/lib/modules/5.10.104-tegra/kernel/drivers/crypto’: Structure needs cleaning
/lib/modules/5.10.104-tegra/extra/opensrc-disp/nvidia.ko
/lib/modules/5.10.104-tegra/extra/opensrc-disp/nvidia-drm.ko
/lib/modules/5.10.104-tegra/extra/opensrc-disp/nvidia-modeset.ko

Thanks
Tim

Hi,

Could you try to run the device query with sudo to see if anything different?
This helps to verify if this is a permission issue.

Thanks.

Hi @AastaLLL ,

sorry for the late response but can confirm running with sudo doesn’t change anything

Hi,

Would you mind adding the account to video group and trying it again?

sudo usermod -a -G video <username>

Thanks.