Hi,

I am moving a GraphDef that includes a model trained on desktop/server class GPUs to work on the Jetson TX2. The model output on the TX2 is very bad, so I started tracing through layers until I found the first different output from a GTX 1080 was after one of my convolution layers. I drilled down to find that the output of the BatchToSpaceND operation is not working correctly. There is no zero-padding on dimensions that I expect to be zero-padded and none of the input tensor values seem to be preserved by the reshape.

Upon searching I found https://devtalk.nvidia.com/default/topic/1036144/jetson-tx2/tensorflow-operation-tf-batch_to_space_nd-function-not-working-as-expected-on-jetson-tx2/ and ran a similar test. Rather than random data I insert ones so that I know when corruption is occuring. Here is the result on a run with GPU:

```
In [3]: import os
...: os.environ['CUDA_VISIBLE_DEVICES'] = '0'
...: import tensorflow as tf
...: import numpy as np
...: mat=np.ones((1,65,65, 543))
...: in1=tf.constant(mat,tf.float32)
...: block_shape=tf.constant([2,2],tf.int32)
...: paddings=tf.constant([[2,3],[2,3]],tf.int32)
...: op=tf.space_to_batch_nd(in1,block_shape,paddings)
...: print(in1)
...: print(op)
...: with tf.Session() as sess:
...: out=sess.run(op)
...: print('sum of elements in out:',np.sum(out))
...:
Tensor("Const_3:0", shape=(1, 65, 65, 543), dtype=float32)
Tensor("SpaceToBatchND_1:0", shape=(4, 35, 35, 543), dtype=float32)
2018-07-26 21:39:28.232793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-26 21:39:28.232885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-26 21:39:28.232915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-07-26 21:39:28.232937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-07-26 21:39:28.233029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 973 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
('sum of elements in out:', 0.0)
```

As you can see the sum is 0, which is incorrect. If I restart python and hide the GPU then I get the correct result.

```
In [1]: import os
...: os.environ['CUDA_VISIBLE_DEVICES'] = ''
...: import tensorflow as tf
...: import numpy as np
...: mat=np.ones((1,65,65, 543))
...: in1=tf.constant(mat,tf.float32)
...: block_shape=tf.constant([2,2],tf.int32)
...: paddings=tf.constant([[2,3],[2,3]],tf.int32)
...: op=tf.space_to_batch_nd(in1,block_shape,paddings)
...: print(in1)
...: print(op)
...: with tf.Session() as sess:
...: out=sess.run(op)
...: print('sum of elements in out:',np.sum(out))
...:
Tensor("Const:0", shape=(1, 65, 65, 543), dtype=float32)
Tensor("SpaceToBatchND:0", shape=(4, 35, 35, 543), dtype=float32)
2018-07-26 21:38:02.678283: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-26 21:38:02.678361: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (tegra-ubuntu): /proc/driver/nvidia/version does not exist
('sum of elements in out:', 2294175.0)
```

I have noticed that 543 seems to be the tipping point. Above 543 I always get corruption. Below 543 it seems ok. If I use > 543, but swap the dimension ordering then I also seem to get sane results. On a first run I will get a sum of zeros, but if I’ve been doing a lot of computation (or for example, I load my protobuf file first) then I get seemingly random numbers.

I was using Pete Lee’s tensorflow originally, but just reproduced the same results with the NVIDIA released tensorflow 1.8 from https://devtalk.nvidia.com/default/topic/1031300/jetson-tx2/tensorflow-1-9-rc-wheel-with-jetpack-3-2-/.