1660ti slower than 1050 (not ti) in tf.float16 matrix multiplication

CUDA 10.1, cuDNN 7.6.5, tensorflow 2.2.0

Below is a simple code that compares matrix multiplication as performed on a CPU and on a GPU. I run it on 1050 4GB (not TI) and 1660 TI, and get strange results. When data_type is set to tf.float32, then 1660ti is predicatably much faster than the 1050. However, when set to tf.float16, 1050 is more than twice as fast as the 1660ti. Why is this? Does it have something to do with the 1660ti reporting its compute capability as 7.5 when in fact it is not really the case (fp16 cores instead of tensor cores)?

This is the code:

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import time

def get_times(maximum_time):
device_times = {
“numpy”:,
“/gpu:0”:
}
device_times_keys = {
“/gpu:0”:
}

matrix_sizes = [10, 100, 200, 500, 1000, 2000, 3000, 5000, 7000, 10000]

for size in matrix_sizes:
    print(size)
    shape = (size,size)
    data_type = tf.float16                                                             # <-- This is the one I am talking about!
    r1np = np.random.uniform(low=0, high=1, size=shape)
    r2np = np.random.uniform(low=0, high=1, size=shape)
    
    for device_name in device_times_keys.keys():
        r1 = tf.convert_to_tensor(r1np, dtype=data_type)
        r2 = tf.convert_to_tensor(r2np, dtype=data_type)
        print("####### Calculating on the " + device_name + " #######")

        with tf.device(device_name):
            start_time = time.time()
            dot_operation = tf.matmul(r2, r1)

            time_taken = time.time() - start_time
            print(time_taken)
            device_times[device_name].append(time_taken)
    
    print("####### Calculating with numpy #######")
    start_time = time.time()
    dot_res = np.matmul(r1np, r2np)
    time_taken = time.time() - start_time
    print(time_taken)
    device_times["numpy"].append(time_taken)

return device_times, matrix_sizes

def main():
device_times, matrix_sizes = get_times(60)
print(device_times)
numpy_times=device_times[“numpy”][1:]
gpu_times = device_times[“/gpu:0”][1:]

plt.plot(matrix_sizes[1:], gpu_times, 's-.')
plt.plot(matrix_sizes[1:], numpy_times, 'r-')
plt.ylabel('Time, sec')
plt.xlabel('Matrix size')
plt.show()

return