matmul with large matrices fails with float16, but succeeds with float32

I posted this bug on the Tensorflow github, but have not gotten any help, which is not unexpected, as it is probably not a Tensorflow bug, but a CUDA one:

I experienced this bug on many different combinations of Tensorflow/CUDA/cuDNN/Python, including the current Tensorflow 2 alpha with python3 and CUDA 10. I am using a 1070.

Essentially, there seems to be an upper limit on the single largest dimension of one of the operands of matrix multiplication, which isn’t unreasonable, but, strangely, this limit is higher for float32 than for float16. The following Tensorflow 2 code:

import numpy as np
import tensorflow as tf

@tf.function
def myMatMul (x,W):
    return tf.matmul (x,W)

W = np.ones ((2,2), dtype=np.float16)
x = np.zeros ((8500000,2), dtype=np.float16)

output = myMatMul (x,W)

crashes with the following error:

Blas GEMM launch failed : a.shape=(8500000, 2), b.shape=(2, 2), m=8500000, n=2, k=2

If you change the large dimension to 8e6, the code runs. But, strangely, if you keep the large dimension at 8.5e6, and change the data types to float32, the code runs as well, which seems like a bug.

Is there anywhere else I should report this bug, or is this forum the appropriate place for it?