TensorFlow Cats vs Dogs

I installed TensorFlow using these instructions;

http://www.jetsonhacks.com/2017/09/18/build-tensorflow-on-nvidia-jetson-tx1-development-kit/

I’m able to run simple tests.

I wanted to test a cats vs dogs example but when I do, the process either reboots the system or ends with a Process Killed message.

The code I used is in the below repo;

https://github.com/jtfogarty/Jetson-TX1/tree/master/tensorflow/cats-vs-dogs

Could someone run this code? The time consuming part is downloading the images from kaggle

Thanks for the help

FYI, the process killed message would be from running out of RAM. You may need to set your kernel feature CONFIG_SWAP to yes, and then add swap space (e.g., via a SATA drive, USB drive, or SD card).

The R28.1 Documentation package gives good information on kernel build, plus there are other places where kernel build is explained. The key detail many people miss is that “uname -r” has a prefix based on the kernel version, and a suffix based on CONFIG_LOCALVERSION…the suffix must match the current running install if modules are to be found and loaded correctly. Example “uname -r” output:

4.4.38-tegra

…in that example kernel version is 4.4.38, CONFIG_LOCALVERSION during that kernel compile was “-tegra”.

Instead of using the tegra21_defconfig for kernel starting configuration I suggest you copy “/proc/config.gz” somewhere, gunzip it, and then rename it to “.config” in your build area. Edit CONFIG_LOCALVERSION and you are set to begin kernel build (you would of course use something like “make nconfig” to find and enable CONFIG_SWAP).

FYI, you might want to monitor something like “htop” while running your program…I’m pretty sure you’ll see RAM use climbing up and then failure near the max RAM consumption point.

Hi,

Could you share more information about this issue?

  1. Which script do you execute?
  2. Error message.
  3. tegrastats result when hitting the error.
sudo ./tegrastats

Thanks.

here is what I ran and the output

nvidia@tegra-ubuntu:~/projects/Jetson-TX1/tensorflow/cats-vs-dogs$ python training.py 
There are 12500 cats
There are 12500 dogs
2017-10-03 11:06:20.797007: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-10-03 11:06:20.797306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.9984
pciBusID 0000:00:00.0
Total memory: 3.89GiB
Free memory: 1.80GiB
2017-10-03 11:06:20.797446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-10-03 11:06:20.797551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-10-03 11:06:20.797658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)
Step 0, train loss = 0.70, train accuracy = 0.00%
2017-10-03 11:07:17.633077: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:639] failed to record completion event; therefore, failed to create inter-stream dependency
2017-10-03 11:07:17.076118: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:639] failed to record completion event; therefore, failed to create inter-stream dependency
2017-10-03 11:07:17.021106: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED
2017-10-03 11:07:17.092365: E tensorflow/stream_executor/cuda/cuda_driver.cc:1098] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED :: No stack trace available
2017-10-03 11:07:17.076156: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:639] failed to record completion event; therefore, failed to create inter-stream dependency
2017-10-03 11:07:17.844991: F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
2017-10-03 11:07:17.855882: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1
Aborted






RAM 3594/3983MB (lfb 5x4MB) SWAP 1222/10240MB (cached 87MB) cpu [28%,13%,19%,3%]@1224
RAM 3582/3983MB (lfb 5x4MB) SWAP 1233/10240MB (cached 80MB) cpu [18%,12%,23%,23%]@102
RAM 3571/3983MB (lfb 5x4MB) SWAP 1234/10240MB (cached 71MB) cpu [16%,15%,10%,26%]@204
RAM 3553/3983MB (lfb 5x4MB) SWAP 1243/10240MB (cached 61MB) cpu [16%,21%,45%,21%]@1734
RAM 3531/3983MB (lfb 5x4MB) SWAP 1262/10240MB (cached 59MB) cpu [12%,31%,50%,12%]@204
RAM 3395/3983MB (lfb 6x4MB) SWAP 1209/10240MB (cached 50MB) cpu [20%,26%,3%,25%]@1224
RAM 3126/3983MB (lfb 21x4MB) SWAP 777/10240MB (cached 43MB) cpu [56%,4%,76%,27%]@1734
RAM 817/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 44MB) cpu [67%,27%,41%,3%]@816
RAM 817/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 44MB) cpu [14%,10%,1%,2%]@102
RAM 817/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 44MB) cpu [7%,0%,0%,11%]@102
RAM 817/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 44MB) cpu [11%,2%,12%,17%]@102
RAM 818/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 45MB) cpu [10%,5%,8%,3%]@204
RAM 818/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 45MB) cpu [5%,5%,1%,8%]@102
RAM 818/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 45MB) cpu [7%,4%,0%,12%]@102

Hi,

RAM 3395/3983MB (lfb 6x4MB) SWAP 1209/10240MB (cached 50MB) cpu [20%,26%,3%,25%]@1224

From the tegrastats data, the device may run out of memory.
Please enlarge your swap size and recheck it.
Thanks.

The SWAP file is set to 10G. I only see it get to 1.2G

Has anyone attempted to try this?

I have the code ready on github
https://github.com/jtfogarty/Jetson-TX1/tree/master/tensorflow/cats-vs-dogs

I assume you monitored something like htop while running. Try running “dmesg --follow” as well and see if dmesg outputs anything as the kill hits (serial console will correctly output dmesg longer than a regular terminal will).

Hi,

Sorry for the misleading. We may miss something important.

Swap can only enlarge CPU memory, not GPU memory amount.
If your model requires more than 4G GPU memory, out of memory will occur no matter how much swap amount is added.

From the error message, this program already reach the memory maximum.

RAM 3395/3983MB (lfb 6x4MB) SWAP 1209/10240MB (cached 50MB) cpu [20%,26%,3%,25%]@1224

If you want to run this program on Jetson platform, please use TX2 which has twice memory amount.
Thanks and sorry for the inconvenience.