Where is cute's gemm code?

Hi! I am learning cute from cutlass, I have heard this is an open source library, but I am fully confused by its gemm implementation… I am here:

gemm(tCsA, tCsB, tCrC);

So in detail, exactly, where does this gemm function implemented?? I mean, not something called gemm from here and there, but exactly put in A and B, and output C, how does this implemented??? I am not sure… is it really published? And how to find them??

I see in sgemm_nt_1.cu,there is a line: include <cute/tensor.hpp>, and I guess gemm is defined inside(I am not sure! How can I make sure?)

So I find inside of cute/tensor.hpp, I find: include <cute/algorithm/gemm.hpp>

So I find gemm.hpp, I find many many gemm function! But all of them says:

a lot lot lot of things and then:
gemm(thr_mma, tCrA(,,k_block), tCrB(,,k_block), tCrC);

From another Gemm???

Each step, I am not sure. So for now, I really can not move forward… Could someone tell me, exactly, where is gemm?

cutlass builds upon the ordinary cuda api mma_sync() and the exposed mma ptx instructions.
What exactly are you looking for? Calls to mma_sync()? ptx constructs?

An IDE could tell you which overloads of gemm in <cute/algorithm/gemm.hpp> are used. You could also just step through the code with a debugger like cuda-gdb.

I think the core implementations are located in include/cutlass/arch/

1 Like

Firstly, I used vscode, and I press ctrl, nothing happened. (Normally should link to correct place)
Secondly, what I am looking for is the definition of gemm. exactly, how mma_sync works.

Thank you!! How can I do this???

m̀ma_sync() / the respective ptx instruction is the lowest exposed by nvidia. You won’t find out how it works looking at cutlass.

For general tensor core programming, I would recommend the following blogpost Programming Tensor Cores in CUDA 9 | NVIDIA Technical Blog

1 Like

No… I am learning to modify cutlass…


See here, how can I enter this gemm?

This code is :https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_nt_1.cu

Thank you!!!

The first entry point is https://github.com/NVIDIA/cutlass/blob/56fc3df03b57c5e1a825ec747799bc0f0df4b860/include/cute/algorithm/gemm.hpp#L70

1 Like

Thank you very much!!! But how you find it? And further, inside of this gemm, there is another gemm!!?? Where does it come from??

template <class TA, class ALayout,
          class TB, class BLayout,
          class TC, class CLayout>
CUTE_HOST_DEVICE
void
gemm(Tensor<TA, ALayout> const& A,
     Tensor<TB, BLayout> const& B,
     Tensor<TC, CLayout>      & C)
{
  return gemm(C, A, B, C);
}

The gemm() functions are all defined in this file. You can look at the number of arguments and type of arguments to identify them.
The next one is a function with 4 arguments of type Tensor. https://github.com/NVIDIA/cutlass/blob/56fc3df03b57c5e1a825ec747799bc0f0df4b860/include/cute/algorithm/gemm.hpp#L113 or https://github.com/NVIDIA/cutlass/blob/56fc3df03b57c5e1a825ec747799bc0f0df4b860/include/cute/algorithm/gemm.hpp#L161

Eventually, you will reach the only gemm() functions which does not contain another gemm call.https://github.com/NVIDIA/cutlass/blob/56fc3df03b57c5e1a825ec747799bc0f0df4b860/include/cute/algorithm/gemm.hpp#L197

1 Like

Wow, that’s cool! How clever you are!!!

Wait, there is still one question, so now you are here:

And how you know, gemm is from include <cute/tensor.hpp>???

Maybe other files like: include <thrust/host_vector.h> include <thrust/device_vector.h>???

It is very hard to verify! Because each file can have even more linkages, and you can not go through everyone to exclude!

<thrust/host_vector.h> and <thrust/device_vector.h> have nothing to do with matrix multiplication. this leaves only <cute/tensor.hpp>

I already suggested using a debugger to step through the code. You can verify that you will reach the file with the gemm calls we have discussed so far.

Thank you! Actually I do tried…but…

nvcc -o sgemm_nt_1 sgemm_nt_1.cu -arch=sm_80 -std=c++17 -I …/cutlass/include -I …/cutlass/tools/util/include --expt-relaxed-constexpr -O0 -g

(cuda-gdb) break sgemm_nt_1.cu:210
Breakpoint 1 at 0xd907: file sgemm_nt_1.cu, line 222.

You see, here I want to break at 210 line (which is gemm), but it gives me breakpoint at 222 line??? Why???

Your compile command does not generate debug symbols for device code. You need to pass -G too.

For the lines, just try and run it. If it does not work, you can instruct cuda-gdb to always break on the first instruction of a kernel. Then the code will be loaded and kernel lines may work. CUDA-GDB

1 Like

Thank you! It works!!! Mostly…

So CUDA-GDB finds the function step by step, but failed at last step… Like below:

(cuda-gdb) break sgemm_nt_1.cu:210
Breakpoint 1 at 0xd907: file sgemm_nt_1.cu, line 222.
(cuda-gdb) run
Starting program: /home/zyhuang/temp_can/sgemm_nt_1 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
cuda-gdb failed to grab the lock file /tmp/cuda-dbg/cuda-gdb.lock.
Another CUDA debug session (pid 97601) could be in progress.
Are you sure you want to continue? (y or [n]) y
[Detaching after fork from child process 98556]
[New Thread 0x7fffdffff000 (LWP 98560)]
[New Thread 0x7fffdf7fe000 (LWP 98561)]
Using device 0: NVIDIA A100 80GB PCIe  (SM80, 108 SMs)
M = 5120
N = 5120
K = 4096
Verification by comparison with cuBLAS is disabled, either because the CMake option CUTLASS_ENABLE_CUBLAS was explicitly set to OFF, or because CMake could not find cuBLAS.  If you would like to enable verification with cuBLAS, please set the CMake option CUTLASS_ENABLE_CUBLAS to ON, rerun CMake, and recompile this example.
ahahahaahahhahahahhhhhhhhhhhhhhh
[Switching focus to CUDA kernel 0, grid 1, block (2,7,0), thread (0,0,0), device 0, sm 66, warp 18, lane 0]

Thread 1 "sgemm_nt_1" hit Breakpoint 1, gemm_device<int, int, int, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<16>, cute::C<16> >, cute::tuple<cute::C<1>, cute::C<16> > >, float, float>
   <<<(40,40,1),(256,1,1)>>> (M=-218129807, N=32767, K=0, A=0x7fff84000000, dA=..., blockA=..., tA=..., 
    B=0x7fff96000000, dB=..., blockB=..., tB=..., C=0x7fff8e000000, dC=..., tC=..., alpha=0, 
    beta=-1.0125765e+31) at sgemm_nt_1.cu:210
210         gemm(tCsA, tCsB, tCrC);
(cuda-gdb) s
cute::gemm<cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > ><<<(40,40,1),(256,1,1)>>> (
    A=..., B=..., C=...) at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:74
74        return gemm(C, A, B, C);
(cuda-gdb) s
cute::gemm<cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > > (D=..., A=..., 
    B=..., C=...) at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:171
171       return gemm(MMA{}, D, A, B, C);
(cuda-gdb) s
cute::gemm<cute::UniversalFMA<float, float, float, float>, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, (void*)0> (mma=..., D=..., A=..., B=..., C=...)
    at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:454
454       gemm(mma,
(cuda-gdb) s
455            make_tensor(D.data(), prepend<3>(D.layout())),      // (1,M,N)
(cuda-gdb)

You see, here the last function is 454 line’s gemm(mma, … And actually you can see it should be directed to here like you told me. Why not? (Maybe some bugs in cuda-gdb?)

The function call spans multiple lines. All lines will be processed by the debugger before entering the function.

1 Like

Oh! Good answer!! Thank you!!!

(cuda-gdb) step
cute::gemm<cute::UniversalFMA<float, float, float, float>, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, (void*)0> (mma=..., D=..., A=..., B=..., C=...)
    at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:454
454       gemm(mma, make_tensor(D.data(), prepend<3>(D.layout())), make_tensor(A.data(), prepend<3>(A.layout())), make_tensor(B.data(), prepend<3>(B.layout())), make_tensor(C.data(), prepend<3>(C.layout())));     // (1,M,N)
(cuda-gdb) step
cute::Tensor<cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > >::data (this=0x7ffff2fff1a8)
    at /home/zyhuang/temp_can/../cutlass/include/cute/tensor.hpp:166
166         return engine().begin();
(cuda-gdb) step
cute::Tensor<cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > >::engine (this=0x4) at /home/zyhuang/temp_can/../cutlass/include/cute/tensor.hpp:154
154         return get<1>(rep_);
(cuda-gdb) step
cute::get<1ul, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ArrayEngine<float, 64> > (t=...) at /home/zyhuang/temp_can/../cutlass/include/cute/container/tuple.hpp:213
213       return detail::getv<I>(t);

Well, I changed gemm code into one line like:

gemm(mma, make_tensor(D.data(), prepend<3>(D.layout())), make_tensor(A.data(), prepend<3>(A.layout())), make_tensor(B.data(), prepend<3>(B.layout())), make_tensor(C.data(), prepend<3>(C.layout())));     // (1,M,N)

But still it can not enter this gemm…? Do you possibly know why?

If I counted correctly, this one line contains 16 function calls which happen before gemm.

1 Like

Emmm… So I want, after I enter 454 line, gemm(mma, …

I want to directly enter this gemm, but not enter sth strange, like engine().begin();

Do you possibly know how?

Thank you!!!

It’s not strange. When you step through code it will follow the code exactly like it would be executed. You could simply continue stepping until all arguments of gemm have been computed and you reach gemm.
Or you can set a break point directly at the start of the gemm function.

This is no different than ordinary host debugging.

1 Like