How to achieve peak tensor core utilization

I am trying to use the tensor cores in this Titan V to speed up my signal / image processing kernels which are finely tuned using register arithmetic. The CUDA access to the MMA operation seems straightforward enough and the MMA processing seems well-suited to many signal processing routines such as convolution. With this version of Tensor Core, I am limited to fp16 processing for my application and I only have a set number of dimensions for the matrix multiply. What really seems to limit my ability to use the tensor cores is the latency of the global memory to access new data for new matrices. If the shared memory were larger, I could get more use out of the tensor cores before I have to return to global memory.
My current register arithmetic versions of these kernels are able to hide the latency somewhat because they are able to operate on the data as it comes in. Still, the best I can do with the tensor cores is approach the execution time achieved by the older, register-based code. There is no speedup because of the slow nature of the global memory. The profiler spots no inefficiencies in my global memory accesses. Is there a smart way to keep the tensor cores well fed with data? Otherwise I don’t see how they are helpful for me at least in the Volta version.

Hi,

We recommend you to please refer to the following docs, which may help you.

Thank you.