Does anyone have any pointers on how to implement 2D convolution using tensor cores (thus WMMA ops)?

I know I can use CUDA’s libs but I want to learn; something similar to say the matrix multiplication example in the SDK? (I guess I could figure out caching sub-blocks to shared memory ;)

I do get how to do convolution via matrix multiplication/Toeplitz - but since tensor cores do a pretty big block (16x16 x warp or 8x32 x warp?) how would that work for say a 3x3 (step 1) 2D convolution? Thus filter size would be k=3 and one matrix side would be 9 - do you simply zero the padding and take the overhead hit on wasted operations?

Or is it done with Winograd transformations instead? Just cannot figure out how to use tensors for small filter convolution?

Many thanks,

Adrian