JTX1 becomes unresponsive with ZED stereo camera & ROS wrapper

Hello folks,

I got a Jetson TX1 that I recently flashed using Jetpack L4T 2.1 linux x64 and post flash installed CUDA and the OpenCV 4 tegra

Then I installed ROS indigo, the ZED driver and successfully compiled the ZED ROS wrapper.

I am able to lock up (freeze) the Jetson TX1 almost instantly as follows:
In terminal 1 => roslaunch zed_wrapper zed.launch In terminal 2 => rqt_image_view
In terminal 3 => $ rostopic echo /camera/point_cloud/cloud > cloud.txt, give it a minute and it locks up the JTX1.

At first I thought that perhaps with the high data rate and all, saving the data in a file was the culprit. However, when I don’t perform the third step (i.e. I just launch the zed_wrapper in one terminal and the rqt_image_view in another), it runs fine for about an hour and then the JTX1 becomes unresponsive again.

I first thought that this might be an overheating issue, but when I logged the temperatures using Psensor, I am not going over 50 degrees centigrade for both CPU and GPU. As a matter of fact, the last time the JTX1 became unresponsive I was only at 44 degrees C for the CPU and GPU. In all cases the fan was running. My CPU runs at 33 degree centigrade when it idles.

I get similar results if I am not running rqt_image_view, but for example print out the cloud point topic (i.e. $ rostopic echo /camera/point_cloud/cloud).

I once had the JTX1 freeze up when I was running the ZED Depth Viewer in /usr/local/zed/tools but I am currently have been running this tool for over three hours now without locking up: both the CPU and GPU are at a steady 50 degree C.

Has anyone else had similar experiences?

Cheers

Galto

Reposting Galto2000’s reply since I can’t seem to unhide it:


I am just replying to my own post here. For some reason I was categorized as a spammer, but that has been fixed now, but in the mean time this caused my initial post to get buried under new posts.

Anyhow I have some more insights.

  1. I am pretty convinced that his is not a overheating issue.
  2. I have also disabled the USB auto-suspend, but that didn’t help either.

Thanks

Galto

It sure seems to be overheating. I’ve gotten mine into a state that it wouldn’t boot back up until it cooled down some. However if you view the images on another machine via ROS for instance it will run all day. So its something to do with running the Xserver and its app along with the application to read the Zed all on the same machine.

It could be something to do with the Zed apps too. I’ve not been able to run them via stty -X for instance. Sample programs are usually testing hacks that are polished up a bit and shipped. But if you are running the ROS wrapper and use rqt on another machine to view the resulting images it will not shut down. So I’m home free but I can see others having issues if you are trying to use the built in video to view the output of the Zed.

The OS being a mix of 32 and 64 bit shouldn’t make a difference because the kernel and video driver are 64 bit. So my early theory of it having to do 2 transfers for each word instead of 1 isn’t valid. I saw this early on in the Itanium days at Digital. Software was 32 bit chip was 64 bit. But the big difference was no 64 bit kernel or driver. The cpus would heat up quick when you were doing some heavy processing.

It is puzzling though as those same programs will run on a TK1 all day without issues. CUDA is 6.5 on the TK1 7 on the TX1 that could have something to do with it. Hard to pin down with out being able to talk to a developer. They could be using some hack that works great on 6.5 and doesn’t on 7.0 Going around the API for more performance comes to mind. Doesn’t leave anything behind when it crashes to look into that I can find. Easy enough to write a script to print out the cpu temp so you can see where it dies and the temp rise. The temp rises quick. Helps to turn on the fan manually before you start running anything with the Zed. But that only gives you a little more time. Eventually it will get to a point the fan isn’t doing much good just slows the rise. Mine tends to die at around 55C. I put it in a mini itx case with 2 more fans directly blowing on the processor module with not much effect. Again it prolonged the time but eventually still crashed. When looking at the images with another computer though the temps are pretty steady and never go anywhere near 55C. Although if you don’t have aux fans it can get behind the curve keeping the cpu cool if you don’t turn on the cpu fan first. Temps go higher then level off with just the stock fan and allowing it to decide when to come on. Its not usb-autosuspend. If you pull the Zed while running it just goes dark doesn’t halt the machine. Seems to be some weirdness between 6.5 and 7.0 of cuda lib. Haven’t tried running 6.5 on the TX1 but if they compiled their apps which seem to be the only apps that crash the machine with 6.5 maybe that is the issue. Should work but I’ve spent plenty of time debugging stuff that should work :) I haven’t looked for the source for those apps its possible if they were compiled on a TX1 they wouldn’t exhibit the behavior too.

Initially I was very frustrated with the TX1. I had all my stuff running on 2 TK1’s figured it would just be a simple matter to switch it over. Thats what I get for thinking ;) But now all my hardware is working just fine. I do keep the 2 TK1’s at the same level so I can swap them out if I run into a problem on the TX1 that is going to take some time to solve. Work in parallel. Get stuck on the TX1 problem and switch over. Keeps the frustration level down and I get more done. Those switches aren’t happening much any more. And if you look back through the forums the TK1 has a similar start.

As a former support tech myself I know it can be really tough to find the “guy” that can tell you right off what is happening with a piece of hardware/software. And the “guy” is usually covered over with other stuff has to disengage then look at your stuff. Some “guys” do that well most don’t. The better they are at their task it seems the harder it is for them to break away and pivot to something else. So its all a big push pull deal. As a support rep you get a new board dumped in your lap with minimal training. The materials haven’t been written yet. Since I’ve switched over to doing everything remotely using ssh -X and ROS no problems with the Zed. Actually I’m still amazed it works at all on an embedded board. Its working the TX1 hard. Temps don’t go up nearly as fast running software that is supposed to torture the board.

Of course all of you aren’t using ROS but if you want a quick way to verify the hardware is working its a good test. And I’m not so sure if you wrote a piece of software using the Zed it would cause any problems. I grabbed a little program off a github that used opencv and displayed Zed images it never crashed. https://github.com/Myzhar/qt-jetson-zed-opencv-nosdk So try that and see if it cures the problem. I quit looking at it but as I remember that program will run forever as well.

Dan

Hi Dan,

thanks for the feedback.

It certainly behaves like an overheating problem, as that was my first thought, but in the mean time I have had my JTX1 hang up twice after only 6-10 minutes of running and a CPU temperature of around 38 C (when running the ZED ROS wrapper and rqt_image_view both on the JTX1). I am using Psensor and GKrellM for monitoring my temperatures btw.

Good idea, I am going run the viewer on my laptop and verify your findings.

I am also wondering if it could be a USB power issue. I have seen embedded boards freeze up before because the USB camera was drawing too much current and the power brick that came with it couldn’t provide it. I have ordered some DC power jacks (should be arriving today) so that I can hook up my JTX1 to my heavy duty desktop power-supply and see if that makes a difference.

Oh, I did disable USB auto-suspend, but it didn’t help btw.

I’ll post my findings (assuming I am finally taken off the spam list)

Galto

I ran Dan’s experiment: I had the ZED ROS wrapper on the JTX1 publishing data and I had rqt_image_view run on my desktop observing the depth image. It was running like this for about 4 hours.

Then I plugged a USB camera in the USB 3.0 hub connected to my JTX1 and guess what, my JTX1 become unresponsive almost right away.

I repeated this right afterwards and it became unresponsive again in just 5 minutes, with my CPU temperature at 43C and my CPU usage at 33%. The fan wasn’t even turning.

I also did a stress test: had a USB 3.0 camera publishing 1280x720 images (via the usb_cam ROS node), and I had 5 instances of rqt_image_view running, all on the JTX1. CPU load was around 65% and temperatures hovered around 50C (fan turning), but I could not get it to crash.

Is there a USB3 HUB involved, and is it powered or unpowered? It might be interesting to see if there is a difference based on where power is coming from.

Yup, there is a USB3.0 HUB involved, and I get the same behaviour whether I plug in the power jack in the USB hub or not. The USB power supply is rated at 3A.

I’ll use my heavy duty desktop power supply next - I just received the DC power jacks from Mouser that mate with the JTX1 and I just have to solder a cable together.

My curiosity was not so much to find out if the device had enough power going to it, but was instead related to side effects of power consumption directly from the Jetson versus external. One side effect is of course if power is drawn from Jetson then there would be more heat produced in power delivery elements of Jetsons, or perhaps even power spikes within other Jetson power rails. However, if an externally powered HUB does not change lockup behavior then it would seem power circuits (and their heat) are unrelated to the issue.