cublasGemmEx doesn't work with INT8 utilizing __dp4a instruction on NVIDIA 1080TI

large matrixes can be multiplied in O(n^2.7) time