How long does it take from the serious CUDA stuff starting to shutdown? What’s temperature of die just before shutdown?
Happens almost instantly - no idea of the die temperature, but the heatsink was not warm.
I don’t think this is a temperature issue, as it works fine on a higher supply voltage.
Is it possible for you to try running the development board on 12V with some facial recognition or other stuff that stresses the CUDA cores?
We’ve tried to repro with 12V adapter but cannot hit any shutdown error.
Just want to align the app we are using, could you hit this error by any CUDA or TensorRT app provided by NV? It would be better if it is directly from Jetpack (MMAPI/Argus/CUDA sample).
I just tried the darknet app and backend sample in mmapi but cannot hit error.
We can’t repro this by now, could you share the detail steps and serious CUDA stuff you used to us to repro?
From our developers the information I have is:
We are using several deep networks:
I don’t think the issue arises due to us using several networks at the same time, e.g. the issue would (probably) happen if we only used mtcnn or only ran object-detection. While we didn’t use Darknet when we experienced the issue, I’m almost certain, that Darknet also would trigger the issue.
Thanks for your reply. May I have your confirm that
- Does this issue only happen to single AGX device or many devices?
- If you have multiple devices that can hit this error, could you confirm that issues only happen when using <12V?
- It sounds there is probability to reproduce the issue, doesn’t it? Does different app have different repro rate? If so, could you tell us what is the most easiest way to reproduce issue…?
You could also share extra comment for this issue. In fact, it is hard for us to debug because there seems no 100% successful way to reproduce issue yet.
To reply to your questions:
1). We only have a single AGX, so I can only comment on that, but it seems that other people in this thread running Darknet Detector demo (not what we are doing) also have the problem.
2). The issue only seems to be when running from 12V - it is not less than 12V but pretty much exactly 12V. The 12V supply as I’ve mentioned previously can supply more than enough current. I haven’t tried increasing it from 12V to say 13V - we’ve now adjusted the supply to 19V and don’t have the issue - more posting this information for others who may have seen something similar.
3). This I can’t really answer, but I think a previous person on this thread gave precise instructions using the Darknet demo with code on how he reproduced it. I don’t think the code itself is the problem - it seems to occur when enough CUDA cores are being stressed (but I’m a HW engineer, not SW!)
So my advice to reproduce this is to go back to the start of the original thread and follow the steps the user outlined with a voltage of 12V.
Thanks for reply.
May I ask what power mode are you using? For more detail about power mode, you could refer to L4T document.
Can you confirm which power supply is used during the shutdown? Could you reproduce issue with original adapter?
Sorry for the slow response - been very busy!
The power supply that we saw the issue with was a M2-ATX-HV 140W Intelligent Very Wide Input 6-32V Vehicle/Battery DC-DC Power Supply from: https://www.mini-itx.com/store/psu?c=10#M2-ATX
The supply was only loaded on the 12v supply line. We measured the 12v rail with an high end Agilent scope and there were no dips / spikes during the incident. The spec for this power supply say that it can handle at least 6A @ 12v and the lack of dips suggests that it can easily.
The next supply we used was from eBay:
Set to 12v we still saw the same problem. We tested the output with a programmable load and again it could easily deliver 6A without problems. With the supply set the to 19V (the same as the original adaptor) we don’t see any problems and so this is what has been now installed in our robot. As it is installed in the robot, it is not possible to make further tests at say 12.5v or 13v.
We have an FAE from Nvidia coming for a meeting on Thursday - I will try and setup an identical supply off-robot with differing voltages to recreate the issue and show him.
Sorry for late reply. How is the status?
After several tests with different power supply, we confirm cannot reproduce issue on the range of 9V~19V.
If anyone still has similar issue, please share the detail of you power supply.
We haven’t had chance to investigate further with the M2 power supply, but it seems to be restricted to this, so I don’t think this is an issue anymore as we have perfect operation from another supply.
However, I think it could be very interesting to know what the peak demands for the Xavier are, because they are definitely much more than 30w when starting up CUDA cores.
The power supply we are now using is capable of 10A@12V and that works fine - the one that seems to fail is supposed to be capable of at least 6A@12V and we tested it to that with a programmable load. I assume that there are big spikes in demand that we didn’t see (even though we had a scope on the supply) that the supply just couldn’t handle. It is quite strange though, as a duplicate supply is powering an AMD Ryzen 5 1600 processor board on the same robot with no problems
We only run the yolo2 detector for this test. The power demand I saw was around 3.5A@12V.
I am having same problems (instant shut downs the moment tf inference starts) with my Xavier powered by a car fuse box connection.
If I start the engine or connect the module directly to the car battery itself, it works normal. Also it works normal on my LiPo 3S 5500mah 30C Battery which is also around 12V.
So I figured my fuse box connection cannot feed enough power (watts) to the module with engine off (12V), but it can when it is around 14.x V with charging generator feeds.
Good to see you can use Xavier on a car now. :-)
Thanks for your feedback!
I am trying to run a Xavier in a car as well off a 12V source. It works fine on nvpmodel 3 for me but the system shuts off randomly if I try to use nvpmodel 0 and run some intensive inferences. I see no voltage drops whatsoever; I’m able to power 70W of motors from the same source without significant voltage drop.
I feel that it may have something to do with the Xavier itself.
One option might be to step up from 12V to 19V but I really don’t want to have to do that.
The drop in voltage is over an extremely short time span (imagine a time relative to a clock cycle in the Xavier). I am just guessing, but voltage spikes/drops over a fraction of a microsecond probably matter.
If you look at a desktop PC power supply and how the regulators are added you will find a great emphasis on multiple power supply regulators working in different phases. I found an interesting URL talking about power supply phases here:
Note that the size of a PC motherboard is far larger than an Xavier. Embedded systems simply can’t handle the board space required by all of those phases. To exist in a tiny environment the alternative is to provide a better regulated main power supply. If you were to look at the power supply in a quality computer, and see how well it regulates, and then think about replacing it with a car battery (including while cranking over the vehicle’s starter motor) it probably sends shivers down your spine. I’m sure the total wattage available from a car alternator and battery far exceeds the consumption of the average desktop PC…but who would consider running a PC directly off of such a system with the rapid changes in delivery voltage? The Xavier is no different in needing stable power regardless of total average power…and is probably far more sensitive to that just because it can’t afford the multi-phase regulators on a tiny footprint.
The issue is not about total current availability, but instead about the stability over times as short as fractions of a microsecond.
Are there any NVIDIA-recommended off-the-shelf power cleaning circuitry? Or might a big fat input capacitor be enough?