Built-in Self-Test for Jetson AGX Xavier

Hi,

We are building an avionics system using Jeston AGX Xavier. As part of system design, we need to include several BIST (Built In Self-Test) diagnostics along with application code.

When I researched online, I found that desktop grade Nvidia GPUs provide “NVML” (Nvidia Management Library). But, it is not supported by Jetson AGX Xavier.

I also found that, Jetson supports “tegrastats”

I felt that, “tegrastats” is limited and doesn’t cover hardware tests comprehensively.

I am looking for “NVML” like SDK for Jetson device.

Can someone please help?

Thanks & Regards,
Aravind

I can’t give you a good answer, but here is some information you might find useful…

Desktop PCs use a discrete GPU (dGPU) via the PCI bus (and a dGPU has its own RAM). Jetsons have an integrated GPU (iGPU) tied directly to the memory controller (and shares system RAM). Much of the GPU management and detection software you will find for the dGPU world depends on PCI query. None of that works on the Jetson since it isn’t a PCI GPU.

It might be possible to write a “virtual” PCI-to-iGPU emulator in kernel space, but it’d be difficult if it is to work with “stock” PCI query tools (I think it could be done by NVIDIA, but it wouldn’t be easy even for them).

tegrastats is aware of the iGPU. I’ve seen other people asking for a “more evolved” tegrastats, which probably wouldn’t be too difficult. It might even be on NVIDIA’s radar, not sure.

Meanwhile, some of the statistics and data applications such as tegrastats uses will be from reading files in “/sys” (which are not real files, they exist in RAM and are actually part of drivers just pretending to be files). At other times tegrastats will perform system calls to the kernel (which might be via a PCI call in the case of a dGPU). If you don’t have the “strace” program, then “sudo apt-get install strace”. This allows watching system calls as they occur (it requires sudo/root access).

One can watch the system calls (which are very close to a C syntax) and figure out what the program is calling. If you are sufficiently motivated, then you could learn to make those system calls directly without strace, or else to read the files noted in strace (the “openat” command is shown when a file is opened, along with the file full path).

tegrastats queries about once or twice per second (not sure of exact rate), and you could follow this content like this just to see it:
sudo strace tegrastats
(use CTRL-c when done)

Similarly, a log file could instead be created:
sudo strace -oTraceLog.txt tegrastats
(then examine TraceLog.txt; use CTRL-c after you’ve read stats maybe twice, there is an enormous amount of output)

Note that you will also see ioctl calls, which are basically extensions to “read everything like a file” when the file is really a pseudo file that is in turn part of a driver.

ltrace (“library trace”) is similar, but for calls to linked libraries.

Because of the lack of PCI interface I don’t think you will find much “preexisting” software for management. On the other hand, any system call you find which might be of interest is something you could ask about here on the forums. Any file which is queried is something you can directly perform without any modification or special software (the “openat” function call).

@linuxdev: Thank you for detailed answer.

It is unfortunate that, we will not be able to use NVLM ( NVML API Reference Guide :: GPU Deployment and Management Documentation (nvidia.com)) calls in Jetson devices.

I understood your suggestion regarding strace. Let me check those things.

Just wanted to ask one more question,
Do you know any reference/library/application code already implemented for performing self-test for Jetson devices? Since Jetson devices are being used in Aerospace applications from many years, Nvidia might have a reference design for it.

Thanks,
Aravind

Someone from NVIDIA might know, but I do not.

This is a longshot, but if you happen to be using Concurrent’s software for the soft/hard realtime, then they would likely have some very good answers:
https://concurrent-rt.com/partners/technology/nvidia/

I say this because developing something close to hard realtime implies a lot of testing, including latencies which would apply to your case.

Thanks @linuxdev

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.