Issue with Nvidia Jetson AGX Orin Developer Kit (64 Gb)

I am using the NVIDIA Jetson AGX Orin Developer Kit (64 GB) to run LLM Inference on it. I am using ollama to run llama3.2:3b inference, the problem with that is the inference is so slow. I am trying to check the process using nvidia-smi but it returns no running process is founded.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.4.0                Driver Version: 540.4.0      CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I am using this python code to check if the cuda is running or not.

import torch

print(torch.randn(1).cuda())

print(torch.rand(5, 3, device=torch.device("cuda")))

and it returns an output

/mnt/storage/miniconda3/envs/usr/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /opt/pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
tensor([-0.2943], device='cuda:0')
tensor([[0.8285, 0.6868, 0.8449],
        [0.6386, 0.1849, 0.4723],
        [0.2465, 0.7683, 0.2403],
        [0.5859, 0.7134, 0.0960],
        [0.8186, 0.0581, 0.5210]], device='cuda:0')

and this is the output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:14:07_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

I am using in this machine Linux-aarch64 with Ubuntu 22.04 Jammy Jellyfish distrubution

*** Please note that this reply is generated by LLM automatically ***

I’d be happy to help you troubleshoot the slow inference issue with LLM on your NVIDIA Jetson AGX Orin Developer Kit (64 GB).

Firstly, it’s great that you’re using the ollama tool to run LLM inference. However, the slow inference performance could be due to various reasons. Let’s break down the possible causes and potential solutions:

  1. CUDA Setup and Installation: Ensure that CUDA is properly installed and configured on your system. You can verify this by running the nvcc --version command. If CUDA is not installed, follow the CUDA Setup and Installation procedure for Ubuntu 22.04.
  2. CUDA Process Check: You’ve mentioned that the nvidia-smi command returns no running process. This could indicate that the CUDA process is not running or is not properly configured. Try running the nvidia-smi -q command to get more detailed information about the CUDA process.
  3. LLM Model Size and Complexity: The Llama3.2-3B model is a large language model, and its size and complexity might be contributing to the slow inference performance. You can try using a smaller LLM model, such as Llama-7B, to see if the inference performance improves.
  4. Hardware Acceleration: The Jetson AGX Orin Developer Kit has a powerful GPU, but it’s essential to ensure that the GPU is being utilized for inference. You can use tools like nvidia-smi or tegrastats to monitor the GPU usage during inference.
  5. Software Optimization: The ollama tool might not be optimized for the Jetson AGX Orin hardware. You can try using other LLM inference tools, such as TensorRT or cuFFT, which are optimized for NVIDIA hardware.
  6. Memory and Compute Resources: Ensure that the system has sufficient memory and compute resources to run the LLM inference. You can monitor the system resources using tools like htop or nvidia-smi.

To further troubleshoot the issue, I recommend:

  1. Check the CUDA version: Ensure that the CUDA version is compatible with the Jetson AGX Orin hardware.
  2. Verify the LLM model configuration: Check the LLM model configuration, including the batch size, sequence length, and precision, to ensure that it’s optimized for the hardware.
  3. Monitor system resources: Use tools like htop or nvidia-smi to monitor the system resources, including memory, CPU, and GPU usage, during inference.
  4. Try a different LLM model: Try using a different LLM model, such as Llama-7B, to see if the inference performance improves.
  5. Optimize the inference code: Optimize the inference code to take advantage of the Jetson AGX Orin hardware. You can use tools like TensorRT or cuFFT to optimize the inference performance.

By following these steps, you should be able to identify and address the root cause of the slow inference performance on your NVIDIA Jetson AGX Orin Developer Kit.

Hi,

How do you set up PyTorch in your environment?
You can find the package that has GPU support under the below link:

Below is our step to install Ollama with GPU support for your reference:

Thanks.

I run these commands, but nothing has changed in the ollama process speed. Meanwhile, when I used jtop command to track the GPU status it show me Jetpack not detected

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks
~0730

Hi,

Do you see GPU-related logs when running ollama serve?
Could you share the logs with us?

Thanks.