Why addmm is fast? How can I use it in my kernel?

Hi! I am developing addmm for my linear layer kernel in pytorch using cpp …I have heard that addmm is reallllly fast because of it’s CUDA implementation and GPU’s structure? I am wondering is it possible to combine it somehow into my matmul kernel. As shown manywhere in the google, I am using a simple shared memory version’s matmul now.

Thank you!!!

I guess by “addmm” you are referring to something in pytorch perhaps. I doubt you could use it directly from kernel code, although the implementation is open source so you could probably adapt something.

You might get better answers to a question like this in a pytorch forum.

1 Like

Haha, actually no… see here cuBLAS :: CUDA Toolkit Documentation 3.3.2 it has “with bias” choice.

“addmm” doesn’t appear anywhere in the cublas documentation you linked.

Your reference to section 3.3.2 doesn’t clear anything up for me.

Like you’re speaking in code. I can’t decipher it. Maybe it will be obvious to someone else.

Anyway, you cannot use CUBLAS from kernel/device code.

Good luck!

1 Like

Emm…yes…actually I guess something related should exist in cutlass, which is open source. I think I can check it later…