We are attempting to run inference on a custom Tensorflow network on our TX2s. Our environment:
- TX2 with Jetpack 28.1
- Cuda 8.0
- cudnn 6.0
- Tensorflow for Python from: https://github.com/jetsonhacks/installTensorFlowTX2
All tests are run after the following commands:
sudo nvpmodel -m 0 sudo ./jetsonclocks.sh
Initially I was just running on the GPU, but after noticing some strange runtimes I forced it to run on the CPU and came up with the following timings for input images with sizes: (512x384) (1024x768) (2048x1536):
CPU Timings: 1.68s, 5.34s, 20.09s
GPU Timings: 21.81s, 27.64s, 47.82s
CPU timings are more or less as expected, each larger image has 4x the # of pixels and the timing increases in such a manner. For the GPU if we ignored an initial 20 second offset it would also be similar.
You can find an nvvp trace file (and screenshot) here:
From the screenshot you can see immediately that the first ~20seconds is occupied by a call to cudaStreamCreateWithFlags.
So the question(s): why does this initial call take so long? Even if this call was instantaneous, the GPU is at least marginally slower than the CPU, is this expected?
One more note, I have tried TensorRT in order to optimize the GPU runtime, however there are a number of layer types that we use which are so far unsupported. At this stage we are just looking to get the best straight TF runtime possible.