TensorFlow GPU runtime worse than CPU - TX2

We are attempting to run inference on a custom Tensorflow network on our TX2s. Our environment:

All tests are run after the following commands:

sudo nvpmodel -m 0
sudo ./jetsonclocks.sh

Initially I was just running on the GPU, but after noticing some strange runtimes I forced it to run on the CPU and came up with the following timings for input images with sizes: (512x384) (1024x768) (2048x1536):

CPU Timings: 1.68s, 5.34s, 20.09s
GPU Timings: 21.81s, 27.64s, 47.82s

CPU timings are more or less as expected, each larger image has 4x the # of pixels and the timing increases in such a manner. For the GPU if we ignored an initial 20 second offset it would also be similar.

You can find an nvvp trace file (and screenshot) here:

Screenshot: https://drive.google.com/open?id=1FVXS_JXEvrYutmHx9uwmogTw1o3y4FeZ
NVVP: https://drive.google.com/open?id=1fELQQNrOsrE9I7AIYN-Z8KkYK76IQZ6R

From the screenshot you can see immediately that the first ~20seconds is occupied by a call to cudaStreamCreateWithFlags.

So the question(s): why does this initial call take so long? Even if this call was instantaneous, the GPU is at least marginally slower than the CPU, is this expected?

One more note, I have tried TensorRT in order to optimize the GPU runtime, however there are a number of layer types that we use which are so far unsupported. At this stage we are just looking to get the best straight TF runtime possible.

Thanks

Tx2 doesn’t appear to be very performative with Tensorflow. Unless the latter is TensorRT
To run Tensorflow at a GPU powered workstation.
Possibly to run NGC Tensorflow dockerized image at Linux x86_64 GPU workstation could do the trick with performance.

Hmm unfortunate to hear the performance on the TX2 is not great. In this case we are developing an embedded system that requires some AI compute, so workstation/cloud based isn’t really an option.

Any idea if some of the other AI platforms (Caffe, Torch) have better performance?

Hi,

We have tested TensorFlow on Jetson before.
Usually, we can gain 6x speed-up for GPU model on a convolution-based networks.

After checking your nvvp file, there are two strange points we want further investigating:
1. Very long delay in cudaStreamCreateWithFlags()
We are going to reproduce this issue in our side. Do you launch other GPU application at the same time?

2. Bad performance of fft2d op
To check if this is an implementation issue, could you monitor the GPU utilization status via tegrastate and share with us?

sudo ~/tegrastats

By the way, the shared nvvp file is generated from maxwell architecture. Suppose it is from TX1.
If not, please recheck if your Tensorflow package is built with Pascal architecture, which TX2 used.

Thanks.

Hi,

We have tested cudaStreamCreateWithFlags() with JetPack3.1 and JetPack 3.2DP. Both finish in the millisecond.
Suppose that the long delay issue comes from TensorFlow implementation.

Thanks.

Thanks for the response. I am using a pre-built .whl file from JetsonHacks:

https://github.com/jetsonhacks/installTensorFlowJetsonTX/blob/master/TX2/tensorflow-1.3.0-cp27-cp27mu-linux_aarch64.whl

I had assumed that since it was labelled as a TX2 version it would be compiled correctly. Are you aware of a pre-built .whl with the appropriate Pascal architecture? If not I will look into building myself.

Thanks

Hi ian.bell87,

You can install TensorFlow on the TX2 using the pip wheel provided in the GitHub repository located here

This is built with CUDA (9.0) support for Jetson TX2, and tested to work with a variety of image classification models.

If you still experience issues perhaps it has something to do with the specific TensorFlow model.

John

Consider http://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html?nvid=nv-int-jnwrtwtttwhjn-33357 and https://github.com/NVIDIA-Jetson/tf_to_trt_image_classification?nvid=nv-int-jnwrtwtttwhjn-33356

Hi All,

Thanks for the suggestions. I grabbed a fresh TX2 and put Cuda 9.0, CudNN 7.0 on it with JetPack 3.2. I also installed the .whl file suggested by jaybdub (TF 1.5 w/ Pascal architecture).

The results are significantly better:

Image size 2048x1536. GPU time (without profiling): 9.1s, CPU time, 18.3s

I did profile the GPU computation again and was surprised to see about 6 seconds of time spend between cudaFree and cudaStreamCreateWithFlags prior to any actual GPU computation. You can find the .nvvp file here:

https://drive.google.com/open?id=1TfJbn76FEKn4IivtEXo1Uf-ZrA6OOMwu

Please let me know if there is any additional info I can provide to help on this one.

How to install nvidia profiler to Jetson?
References:
http://docs.nvidia.com/embedded/devtools/nsp/3.9/index.html
https://developer.nvidia.com/nvidia-visual-profiler
https://devtalk.nvidia.com/default/topic/973073/profiling-using-command-line-nvprof-in-linux-on-jetson-tx1-and-import-into-visual-profiler-cuda-7-5-in-windows/
https://developer.nvidia.com/embedded/nvidia-system-profiler
http://docs.nvidia.com/cuda/profiler-users-guide/index.html

Hi Andrey,

I am able to run the profiler on my TX2 (see the .nvvp file in previous post). I open this file on my host ubuntu machine. What I am after at this point is an understanding in why it takes so long to create the cuda stream.

Related to previous discussion, I noticed that many of the CUDA compute operations in the .nvvp contain the word maxwell. Does this mean I am still using TF built for Maxwell rather than Pascal?

Thanks

Gotcha, seems we can use

nvprof

at Jetson or

nvvp

at Host OS but targeted in Jetson via network.

Hi,

It is abnormal to find Maxwell flag on TX2.
Not sure if this flag comes from TensorFlow implementation.
(They may have some optimization for Maxwell architecture)

More, for a JIT application, the first launch may take times to generate CUDA PTX code.
Could you run TensorFlow multiple times to check if the delay comes from JIT compiling?

Thanks.

Hi AastaLLL,

I tested multiple inferences without re-loading the graph. Happy to report the timings are very good in this scenario:

Loading graph took: 0.817597150803s.
Inference #1 took 10.2474310398s
Inference #2 took 2.05697894096s
Inference #3 took 2.06152796745s
Inference #4 took 2.05470395088s
Inference #5 took 2.06142282486s

I had hoped to run this process as a 1 off, but in light of these timings I will modify it to maintain the graph.

Thanks.