I have a question about concurrent execution in CUDA.
I’m evaluating the performance of the Int8 GEMM using the Volta V100.
In the programming model, it is considered impossible to use CUDA core and tensor core at the same time.
So I created two threads on the CPU and used cublas to run on the cuda core and the tensor core on the other.
Although the variation in performance is severe, on average the combination of cuda core and tensor core has the fastest results.
It is about 15% times faster than that of tensor core at matrix size (32k, int8).
Is it possible to use two cores at the same time ???