here is what I ran and the output
nvidia@tegra-ubuntu:~/projects/Jetson-TX1/tensorflow/cats-vs-dogs$ python training.py
There are 12500 cats
There are 12500 dogs
2017-10-03 11:06:20.797007: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:857] ARM64 does not support NUMA - returning NUMA node zero
2017-10-03 11:06:20.797306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.9984
pciBusID 0000:00:00.0
Total memory: 3.89GiB
Free memory: 1.80GiB
2017-10-03 11:06:20.797446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-10-03 11:06:20.797551: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-10-03 11:06:20.797658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)
Step 0, train loss = 0.70, train accuracy = 0.00%
2017-10-03 11:07:17.633077: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:639] failed to record completion event; therefore, failed to create inter-stream dependency
2017-10-03 11:07:17.076118: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:639] failed to record completion event; therefore, failed to create inter-stream dependency
2017-10-03 11:07:17.021106: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED
2017-10-03 11:07:17.092365: E tensorflow/stream_executor/cuda/cuda_driver.cc:1098] could not synchronize on CUDA context: CUDA_ERROR_LAUNCH_FAILED :: No stack trace available
2017-10-03 11:07:17.076156: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:639] failed to record completion event; therefore, failed to create inter-stream dependency
2017-10-03 11:07:17.844991: F tensorflow/core/common_runtime/gpu/gpu_util.cc:370] GPU sync failed
2017-10-03 11:07:17.855882: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1
Aborted
RAM 3594/3983MB (lfb 5x4MB) SWAP 1222/10240MB (cached 87MB) cpu [28%,13%,19%,3%]@1224
RAM 3582/3983MB (lfb 5x4MB) SWAP 1233/10240MB (cached 80MB) cpu [18%,12%,23%,23%]@102
RAM 3571/3983MB (lfb 5x4MB) SWAP 1234/10240MB (cached 71MB) cpu [16%,15%,10%,26%]@204
RAM 3553/3983MB (lfb 5x4MB) SWAP 1243/10240MB (cached 61MB) cpu [16%,21%,45%,21%]@1734
RAM 3531/3983MB (lfb 5x4MB) SWAP 1262/10240MB (cached 59MB) cpu [12%,31%,50%,12%]@204
RAM 3395/3983MB (lfb 6x4MB) SWAP 1209/10240MB (cached 50MB) cpu [20%,26%,3%,25%]@1224
RAM 3126/3983MB (lfb 21x4MB) SWAP 777/10240MB (cached 43MB) cpu [56%,4%,76%,27%]@1734
RAM 817/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 44MB) cpu [67%,27%,41%,3%]@816
RAM 817/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 44MB) cpu [14%,10%,1%,2%]@102
RAM 817/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 44MB) cpu [7%,0%,0%,11%]@102
RAM 817/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 44MB) cpu [11%,2%,12%,17%]@102
RAM 818/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 45MB) cpu [10%,5%,8%,3%]@204
RAM 818/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 45MB) cpu [5%,5%,1%,8%]@102
RAM 818/3983MB (lfb 422x4MB) SWAP 776/10240MB (cached 45MB) cpu [7%,4%,0%,12%]@102