If the Xavier has 64 Tensor cores in addition to the 512 CUDA cores, can we use the Tensor cores to supplement the processing power of the CUDA cores? Like launching a kernel to use both sets of cores to maximize processing power. So far I have only seen support for the Tensor cores with cuDNN and GEMMs in cuBLAS, but it seems wasteful to have another set of cores which go unused most of the time.
CUDA and Tensor cores have different capabilities and specializiation - see https://www.nvidia.com/en-us/data-center/tensorcore/ for more details and/or google “cuda vs tensor core”.
Thanks for the reply dkreutz,
What I take from this is that the Tensor cores are exclusively used to compute matrix multiplications, which explains their limited uses in cuBLAS/cuDNN. Would this assumption be correct?
Yes, Tensor core does matrix operations only but in a highly efficient way.
I guess the OP question still stands no? If we have to do Matrix mul for cuDNN and it runs fastest and highly efficient on the Tensor cores, why not give 20% (say) of the work to the regular CUDA cores to do this in parallel of the tensor cores?
May I know your use case?
In most our SDK, GPU cores is used together with Tensor cores.
For example, TensorRT will choose a resource to use for the best performance.
To check this, you can profile the application with nvprof:
tensor_precision_fu_utilization: The utilization level of the multiprocessor function units that execute tensor core instructions on a scale of 0 to 10 (HMMA)
tensor_int_fu_utilization: The utilization level of the multiprocessor function units that execute tensor core int8 instructions on a scale of 0 to 10. This metric is only available for device with compute capability 7.2. (IMMA)
I guess the question is do they run at the same time, splitting the work between them (Tensor and CUDA cores) to gain maximum performance or is it either one at any one time?
I’ll check the tensor_xxx params as you’ve suggested, thanks!
This depends on the implementation.
If you trigger both function call in cuDNN, and then they can run the same time.
There is no limitation to concurrent launch the Tensor Core and GPU core.