tensorflow inference result is error

xuefengxiaoyang · February 21, 2019, 8:51am

Hi,dusty_nv
I trained a model on the server with tensorflow, reasoning on the GPU of TX2, and got a wrong result, but the correct result can be obtained on the CPU on TX2. My version of TensorFlow is 1.8, from Box, using CUDA9.0 and cudnn7.15 from JetPack3.3, can you solve this problem for me, thank you！

AastaLLL · February 22, 2019, 6:19am

Hi,

Do you use the same TensorFlow version of the training process?

We found that some implementation (should be GPU implementation for your use case) differs from the version:
https://devtalk.nvidia.com/default/topic/1047511/jetson-tx2/wrong-inference-result-on-the-tx2-/
Could you check if this issue still occurs with the same TensorFlow version of training first?

Thanks.

xuefengxiaoyang · February 25, 2019, 8:13am

Hi

According to your method, I set the tensorflow version of the training model to 1.11.0, and the version on TX2 is also 1.11.0, from Box. But the model’s inferencing on the GPU of TX2 still yields the wrong result，but by specifying the device with tf.device(‘/cpu:0’): inferencing on the CPU of TX2 can get the correct result。
The server environment for training the model:
Ubuntu16.04, CUDA9.0 cudnn7.3.0 tensorflow-gpu==1.11.0
TX2 environment:
Ubuntu16.04, CUDA9.0 cudnn7.1.5 from JetPack3.3, tensorflow==1.11.0 from Box.

My image preprocessing code is as follows. I visualized the preprocessed code, which is correct, because the reasoning on the CPU of the TX2 can get the correct result, just the GPU gets the wrong result, which is very strange.
image = cv2.imread(alltestPath, -1)
save_image = cv2.imread(alltestPath, -1)
image = cv2.resize(image, (self.imageWidth, self.imageHight))

        image0 = image
        save_image = cv2.resize(save_image, (self.imageWidth, self.imageHight))
        h, w, c = save_image.shape
        image = (image * 1.0 / 255) * 2 - 1
        image = np.expand_dims(image, 0)

I need your help,thank you!

xuefengxiaoyang · March 4, 2019, 7:07am

Hi AastaLLL
Is there any way to solve my problem?
Thank you!

AastaLLL · March 5, 2019, 7:56am

Hi,

Sorry for keeping you waiting.

There is a known issue in TensorFlow on TX2:
https://devtalk.nvidia.com/default/topic/1037898/jetson-tx2/tensorflow-batch_to_space_nd-not-working-for-large-channel-sizes-on-tx2/

If there is a batch_to_space_nd() layer inside your model, TensorFlow may handle it incorrectly.
To check this, you can execute it with cuda-memcheck to see if any error.

Ex.

$ cuda-memcheck python myApp.py

Thanks.

xuefengxiaoyang · March 9, 2019, 2:17am

Hi AastaLLL,
Thank you for your help. According to your guidance, I checked my code. There is a batch_to_space_nd() layer inside my model. Do you have any solution?

AastaLLL · March 11, 2019, 12:33am

Hi,

There are two available solutions for your reference.

1) This issue is fixed in Jetson Xavier.

2) Please fallback to CUDA 8.0, which is included in JetPack 3.1.

Thanks.