Tensor Convolution example using WMMA?

CudaBunny · February 22, 2019, 12:00pm

Does anyone have any pointers on how to implement 2D convolution using tensor cores (thus WMMA ops)?

I know I can use CUDA’s libs but I want to learn; something similar to say the matrix multiplication example in the SDK? (I guess I could figure out caching sub-blocks to shared memory ;)

I do get how to do convolution via matrix multiplication/Toeplitz - but since tensor cores do a pretty big block (16x16 x warp or 8x32 x warp?) how would that work for say a 3x3 (step 1) 2D convolution? Thus filter size would be k=3 and one matrix side would be 9 - do you simply zero the padding and take the overhead hit on wasted operations?

Or is it done with Winograd transformations instead? Just cannot figure out how to use tensors for small filter convolution?

Many thanks,

Adrian

cbuchner1 · February 24, 2019, 11:08pm

I’ve tried some other unconventional use of tensor cores myself with limited success: Big number multiplication. For 1024x1024 bit multiplication with full 2048 bit result I need 5 WMMA 32x8x16 int8 operations. I am able to beat the public XMP library ( https://github.com/NVlabs/xmp ) when compiling using 32 bit IMAD for the Turing architecture, although just barely (maybe 10-15% faster).

And in order to get to this final result, I had to work around the official nvcuda wmma APIs and directly use PTX assembly and rely on undocumented register mappings that may change in the next hardware generation.

What kills the speed benefit of the tensor cores is

the overhead setting up the matrices in a way that is suitable to perform a multiplication (plenty of warp shuffles, byte permutes required)
the final overlapping summations and carry propagations required to get from lots of int32 products to the final result.

I would think that even if you manage to find a way to use the tensor cores, the overhead of rearranging the 2D image data into something suitable for the tensor cores might negate the speed benefit you are expecting.

CudaBunny · February 25, 2019, 1:29am

Thank you for taking time to reply. Let’ say you were to try it anyway for “academic” purposes?

I just don’t see how to apply them with odd size matrixes.

To take a (simple) example, a 2D 256x256x3 (RGB) image applying a 3x3x3 filter stride 1 (say a blur/sharpen thus technically could be 1D) then using standard im2col you’d get:

col 3x3x3 = 27, (256-3)/1+1=254 thus output matrix [27 x 64,516]
3 filter of 3x3 we get 2nd matrix [3 x 27]
multiply and reshape to sqrt(64,512) [254 x 254]x3

(ok we’d use padding here to keep same image size but to make example simple)

Now how do you use tensors to multiply such size matrixes? Especially the [1 x …] - do you just pad it to 8x making it [8 x 32] but then you have to pad the other one to [32 x 64,516] meaning a huge amount of wasted calculations?

Is 2D convolution done a different way not with im2col? I just don’t get it.

Thanks,

Adrian

cbuchner1 · February 25, 2019, 2:33pm

I’ve looked at the example given here neural network - 2-D convolution as a matrix-matrix multiplication - Stack Overflow

It seems with a tensor core you could do simultaneous convolutions of multiple images with the same filter kernel. So what the first answer to the above question calls “vectorized I” is then actually multiple images stored in a matrix, with each image taking up one column.

Similarly, the result vector would then receive multiple convolved images.

And then maybe you could work on the source image in a blockwise manner, so you do not have to zero pad your filter kernel so much that it matches the dimension of a 258x258 image. Work on 14x14 pixel chunks of the source image (so the filter only gets zero padded to a 16x16 matrix)

Then maybe those source image chunks could be interpreted as “multiple images” that one could process in parallel.

It could all be prototyped in Matlab or Octave before attempting a CUDA version.

Christian

CudaBunny · February 25, 2019, 3:03pm

Yes - that exactly what I am trying to do.

Just processing a really big 2D image rather than many small ones and just 1 filter.

I guess with “normal convolution” implementation the input gets broken into (thread)-blocks anyway so it’s a matter on how to do it properly for tensors.

I was hoping it was just a matter of im2col-it and then passing it to the tensor example for matrix multiplication and that’s it - but it seems somewhat less trivial to do.

PS. The tensor examples I see use bigger (sub)-tiles of 64x64 (16 warps?) - is it OK/valid to use smaller sub-tiles like 16x16?

Adrian