I am working on a TensorFlow model which uses the operation tf.batch_to_space_nd(). This operation works well on CPU as well as on NVIDIA GeForce GPU. However this op fails while running on Jetson TX2.

On looking into the operation, I figured out that it produces correct output only till a certain input size, after which the result is a matrix of zeros.

https://www.tensorflow.org/api_docs/python/tf/batch_to_space_nd

```
mat=np.random.rand(1,65,65,728)
in=tf.constant(mat,tf.float32)
block_shape=tf.constant([2,2],tf.int32)
paddings=tf.constant([[2,3],[2,3]],tf.int32)
op=tf.space_to_batch_nd(in,block_shape,paddings)
print(in)
print(op)
with tf.Session() as sess:
out=sess.run(op)
print('sum of elements in out:',np.sum(out))
```

The output obtained from the above code is as follows:

(‘in’,<tf.Tensor ‘Const:0’ shape=(1,65,65,728) dtype=float32)>)

(‘op’,<tf.Tensor ‘SpaceToBatchND:0’ shape=(4,35,35,728) dtype=float32)>)

(‘sum of elements in out:’,0.0)

```
c=542
mat=np.random.rand(1,65,65,c)
in=tf.constant(mat,tf.float32)
block_shape=tf.constant([2,2],tf.int32)
paddings=tf.constant([[2,3],[2,3]],tf.int32)
op=tf.space_to_batch_nd(in,block_shape,paddings)
print(in)
print(op)
with tf.Session() as sess:
out=sess.run(op)
print('sum of elements in out:',np.sum(out))
```

The output obtained from the above code is as follows:

(‘in’,<tf.Tensor ‘Const:0’ shape=(1,65,65,542) dtype=float32)>)

(‘op’,<tf.Tensor ‘SpaceToBatchND:0’ shape=(4,35,35,542) dtype=float32)>)

(‘sum of elements in out:’,326.3739)

While running on Jetson TX2 GPU, I have observed that when the number of channels c<=542, the operation works correctly and results in a non-zero output matrix. On the other hand if no.of channels c > 542, it results in a zero matrix of size(4,35,35,c).

Also the running the same on CPU results in correct output irrespective of the channel size.

I want to get this op working on TX2 for input size (1,65,65,728).

Any inputs on what might be causing this issue or any fix would be of great help.