Overheating issue?

As I’ve stated in another thread, I’m running three USB3.0 camera streams and playing them in separate windows. The resolution is 1280x960 @20 FPS (although it appears my latest code is not keeping up with the cameras like three instances of guvcview does).

According to htop, the cpu utilization is around 1.3, which is to say that each processor averages around 33%.

If I let this code run long enough, the system starts stuttering and then hangs. I wondered if there was a memory leak, but then I noticed that I couldn’t get the system to power back on after I powered it off. To get the system to turn on, stay on and boot, I had to let it cool down.

I also noticed after the crash that the bottom of the heatsink was pretty warm.

To determine if this is a heat problem, can someone tell me how to force the fan to 100%?

I’m going to dig around the power settings and see if I can figure it out.

Thanks.

Hi BrianOrb,

The fan is controlled by the two pins “FAN_PWM” (C16) and “FAN_TACH” (B17).
If your more software-oriented you can try to put a 100% duty cycle on the FAN_PWM.
If you more a solder guy you can raise the collector of Q17 (PWM_FAN will be always high at VDD_1V8).
Q17 is nearby the fan connector

Regards,
Ale

Here’s the command for manually controlling the fan PWM from software (run as root):

$ sudo bash
# echo 255 > /sys/kernel/debug/tegra_fan/target_pwm
# exit

The example above sets it to maximum speed (PWM=255)

Thanks to both of you.

I changed the code to pass the three cv::Mat’s by reference instead of by copy to try and speed up the code. I did not get a performance improvement (probably use less memory though). I have to figure out where my bottleneck is. I suspect (and hope) it’s openCV’s display. I’ll add code today to benchmark the frame rate and see if it goes up when I stop displaying the data.

There does not appear to be a memory leak as memory jumps from around 821 MB to 1022 MB and stays there while the code is running.

It crashes even faster now, generally in a few minutes. It seems I can make it crash faster by playing with the firefox window. The heatsink is not getting hot, but I turned the fan on 100% anyway.

Too bad I don’t have a hardware debugger. I wonder if anything goes out over the serial port that might help us understand the dead lock. Is it easy to rig the serial port?

I know the hardware is capable of running this without issue, since 3 instances of guvcview work just fine at full frame rate, so I’m thinking this could be an openCV / display path issue.

At this point, I’m open to suggestions. I think we need to get to the bottom of this deadlock issue because I may not be the only one to have it.

Thanks.

I just tested by displaying only one image, but still capturing the other two and it ran at full frame rate. It looks like openCV can’t keep up with the three images all running at the same time.

I’m going to figure out how to place all the images into a single wide image and then scale them down to fit on the display and see if that runs at full frame rate and see if it hangs the TX1.

FYI, when my workstation failed a while back I used JTX1 for a couple of weeks and found firefox could lock the system (under R23.1…I don’t know how R23.2 might have changed things) when accessing flash video. Because my work station was down I could not see if serial console had anything useful on it (and I agree a lot could be learned if a JTAG debugger were attached during one of those failures).

To wire a serial console see:
http://elinux.org/Jetson_TX1#Serial_Console_Wiring

I have all three images concatted into a single frame, scaled to 33%, and displaying at full frame rate. It runs with a load average of 2.84. I turned on the fan and the sink is staying cool. It has been running for many minutes without issue.

I’m still seeing bad data once in a while.

My next step is to fire up the HEVC encoder. Is there documentation on that? I’m hoping there are hooks in openCV.

Thanks for your support.

FYI: I just left the system running minus firefox for almost two hours and it’s been working fine. I’m going to shut the fan off, and run the original code that crashed to see if it has problems without firefox.

I’m running chrome. I hope it’s more stable.

Thanks again.

I just ran the original code with the fan at 100% and no web browsers up and it still hung.

Cases like this make me wish for the Lauterbach JTAG debugger. Without that narrowing it down is painful until you find a way to consistently reproduce the crash. One possibility is to use sysrq to do a kernel dump, although I’m not sure if the kernel would need rebuilding with more debugging for that to be useful. See:
https://en.wikipedia.org/wiki/System_request

In those past cases where I was able to lock the system under R23.1 the load was not heavy (and as mentioned above, I did not have a host system at the time so I couldn’t even check serial console).

I’ll start with a serial port. I’ll order it today.

I’m going to keep working on the software with the hope that things will become more stable or that the video output is causing the instability.

Question: I let ubuntu do a bunch of updates. Was that a bad, good, or indifferent thing to do?

In my opinion updates are a good idea. More often than not stability will go up, and in cases where stability is problematic, I think it helps to track down issues and resolve design problems which otherwise would always remain annoying.

Question: will the ubuntu updates overwrite anything that nvidia has installed on top of ubuntu.

I’m thinking about the custom openCV, for example.

The only case I know of Ubuntu packages overwriting nVidia packages was from the original R19.2. After this all updates should be safe (especially the ones where a CUDA repository was added and the CUDA packages come from nVidia anyway…just via the normal remote package manager commands instead of manual download).

It is possible that software installed without a package manager could break when a library with a new ABI/API is installed…but I’ve not heard of any such thing. If this did happen I’m sure it would bring a lot of questions up on these forums. If in doubt back up before updating, but I still recommend updating.

Hello,

I am also having issues where my JTX1 locks up. In my case it happens with the ZED stereo camera, when I am running the ZED ROS wrapper in one terminal and running rqt_image_view in another. I am not sure that this is an overheating issue, because I am also running Psensors (graphing the temperature sensors), and I my CPU, GPU and board temperatures are hovering between 48 and 50 degrees - that’s pretty warm, but I wouldn’t say critical, or am I wrong?

The average processor load is around 40%.

This setup, as described above, always makes my JTX1 become unresponsive. Usually within an hour. Once within 20 minutes. But I can also make it unresponsive in less than two minutes if I direct the point cloud messages to a text file as follows: $ rostopic echo /camera/point_cloud/cloud > cloud.txt - but this could be unrelated.

I ran the same test on my laptop (Dell Precision M4700) today, and I stopped after 4 hours of uneventful running.

I am currently running the same test on my JTK1. It has been running without crashing for an hour now, and it runs about 10 degrees C cooler hoovering around 37 degrees C (but I am also having a much lower data rate - which is to be expected with the JTK1).

If there is anybody else with a JTX1 and an ZED and ROS wrapper, I would love to know if there is something wrong with my JTX1 in particular or if this is a systemic issue.

My JTX1 was freshly flashed using Jetpack 2.1 with CUDA and OpenCV for tegra, and with ROS Indigo installed after a full upgrade.

(I actually posted a topic on this subject yesterday, but for some reason it has not posted)

Hi Galto200,

Actually, the OpenCV4Tegra is not well optimized to support camera on Jetson TX1 at present version, and might cause the issue you met.
We’re going to have an update version in coming release soon.

Thanks

Hi Kaycc,

thanks for the insight.

So I guess one way to confirm this would be to remove OpenCV4Tegra and install “regular” OpenCV ?

Cheers,

Galto

If other threads are any indication, yes, going to the non-Nvidia OpenCV should work.

Is there any way you can view the pics on another machine? I was having trouble with the ZED apps running them on the same machine as the camera it would get hot and stop. But when I used ROS to view the images on another machine all is well. Maybe use ssh-X to get to the jetson and view on the other machine to try it out. On my Mac you need to install the optional X server for that to work. And use its terminal program to log into the jetson. Haven’t been able to get it working with Linux on the intel box. The way X works you will be running the display code on the other box so it should work the same as using ROS. The ZED SDK works the GPU pretty hard and seems like it you are trying to use it to display as well it just gets too hot. I put mine in a mini itx case with two more fans blowing directly on it and it would still lock. Took longer but eventually it locked up. You might try using the real X server instead of lightdm to eliminate lightdm from the equation. I wrote a little script to read the temp and print it out continuously. The weirdness started at about 55C. But viewing the images on another machine it never heats up and stops. Even for several hours. I am waiting for the 24.1 cuda and zed sdk to test again. See if when everything is 64 bit it goes away. My theory is somewhere along the pipeline with a mixed system it has to double up on data transfers which heats up the chip since its doing twice the work. Turning the fan on from the get go just prolongs the period it works properly but doesn’t keep it from overheating. At one point I had two fans on top plus another 4 inch blowing across the board. The upgrade that slowed the cpu clock didn’t seem to phase it either. Wondering if you can turn down the GPU clock and see if that does it. I know you can do that on the TK1. Doesn’t seem to be a hardware issue so I expect it to go away at some point. My bet is with 24.2.

Dan

If you are interested in reducing GPU use and memory in general, and if your Jetson does not require a graphical login (such as when doing remote viewing over ssh), you could run in text mode. See:
https://devtalk.nvidia.com/default/topic/937815/jetson-tk1/command-line-boot/post/4887236/#4887236

In the case of a Jetson on a robot, and nobody sitting directly at the Jetson, meaning no video card required, you might as well turn off X11 (remote viewing of an X11 application running on a system without X11 server can still do remote display to another machine which has the actual X11 server). I imagine there are a lot of robot applications of a Jetson where no GUI is needed at all, but still run X11 when it isn’t necessary. If you don’t have a monitor attached, why run the software for a monitor?