How to achieve peak tensor core utilization

agardiner1 · September 9, 2022, 11:27pm

I am trying to use the tensor cores in this Titan V to speed up my signal / image processing kernels which are finely tuned using register arithmetic. The CUDA access to the MMA operation seems straightforward enough and the MMA processing seems well-suited to many signal processing routines such as convolution. With this version of Tensor Core, I am limited to fp16 processing for my application and I only have a set number of dimensions for the matrix multiply. What really seems to limit my ability to use the tensor cores is the latency of the global memory to access new data for new matrices. If the shared memory were larger, I could get more use out of the tensor cores before I have to return to global memory.
My current register arithmetic versions of these kernels are able to hide the latency somewhat because they are able to operate on the data as it comes in. Still, the best I can do with the tensor cores is approach the execution time achieved by the older, register-based code. There is no speedup because of the slow nature of the global memory. The profiler spots no inefficiencies in my global memory accesses. Is there a smart way to keep the tensor cores well fed with data? Otherwise I don’t see how they are helpful for me at least in the Volta version.

spolisetty · September 20, 2022, 6:00am

Hi,

We recommend you to please refer to the following docs, which may help you.

Thank you.

Topic		Replies	Views
problem about tensor core CUDA Programming and Performance	2	721	June 28, 2018
Utilize Tensor Cores of GV100 (Titan V) in OpenCl CUDA Programming and Performance	1	791	January 5, 2018
How to use tensor cores in tensorrt TensorRT	0	375	August 15, 2019
How to improve the speed up? CUDA Programming and Performance	8	2363	June 2, 2009
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2640	August 12, 2017
Disable cores to benchmark performance CUDA Programming and Performance	7	2961	September 14, 2008
CUDA cores vs Tensor Cores Jetson AGX Xavier cuda , nvbugs	16	4926	October 18, 2021
Tensor Cores Jetson AGX Xavier	8	1439	October 18, 2021
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24880	December 21, 2009
Speed improvement CUDA Programming and Performance	18	8416	December 5, 2008

How to achieve peak tensor core utilization

Related topics