Where is cute's gemm code?

202476410arsmart · November 29, 2023, 7:24am

Hi! I am learning cute from cutlass, I have heard this is an open source library, but I am fully confused by its gemm implementation… I am here:

github.com

NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_nt_1.cu

/***************************************************************************************************
 * Copyright (c) 2023 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 * SPDX-License-Identifier: BSD-3-Clause
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 * 1. Redistributions of source code must retain the above copyright notice, this
 * list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright notice,
 * this list of conditions and the following disclaimer in the documentation
 * and/or other materials provided with the distribution.
 *
 * 3. Neither the name of the copyright holder nor the names of its
 * contributors may be used to endorse or promote products derived from
 * this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE

This file has been truncated. show original

gemm(tCsA, tCsB, tCrC);

So in detail, exactly, where does this gemm function implemented?? I mean, not something called gemm from here and there, but exactly put in A and B, and output C, how does this implemented??? I am not sure… is it really published? And how to find them??

I see in sgemm_nt_1.cu,there is a line: include <cute/tensor.hpp>, and I guess gemm is defined inside(I am not sure! How can I make sure?)

So I find inside of cute/tensor.hpp, I find: include <cute/algorithm/gemm.hpp>

So I find gemm.hpp, I find many many gemm function! But all of them says:

a lot lot lot of things and then:
gemm(thr_mma, tCrA(,,k_block), tCrB(,,k_block), tCrC);

From another Gemm???

Each step, I am not sure. So for now, I really can not move forward… Could someone tell me, exactly, where is gemm?

striker159 · November 29, 2023, 8:24am

cutlass builds upon the ordinary cuda api mma_sync() and the exposed mma ptx instructions.
What exactly are you looking for? Calls to mma_sync()? ptx constructs?

An IDE could tell you which overloads of gemm in <cute/algorithm/gemm.hpp> are used. You could also just step through the code with a debugger like cuda-gdb.

I think the core implementations are located in include/cutlass/arch/

202476410arsmart · November 29, 2023, 8:30am

Firstly, I used vscode, and I press ctrl, nothing happened. (Normally should link to correct place)
Secondly, what I am looking for is the definition of gemm. exactly, how mma_sync works.

Thank you!! How can I do this???

striker159 · November 29, 2023, 8:33am

m̀ma_sync() / the respective ptx instruction is the lowest exposed by nvidia. You won’t find out how it works looking at cutlass.

For general tensor core programming, I would recommend the following blogpost Programming Tensor Cores in CUDA 9 | NVIDIA Technical Blog

202476410arsmart · November 29, 2023, 8:39am

No… I am learning to modify cutlass…

See here, how can I enter this gemm?

This code is :https://github.com/NVIDIA/cutlass/blob/main/examples/cute/tutorial/sgemm_nt_1.cu

Thank you!!!

striker159 · November 29, 2023, 9:01am

The first entry point is https://github.com/NVIDIA/cutlass/blob/56fc3df03b57c5e1a825ec747799bc0f0df4b860/include/cute/algorithm/gemm.hpp#L70

202476410arsmart · November 29, 2023, 9:30am

Thank you very much!!! But how you find it? And further, inside of this gemm, there is another gemm!!?? Where does it come from??

template <class TA, class ALayout,
          class TB, class BLayout,
          class TC, class CLayout>
CUTE_HOST_DEVICE
void
gemm(Tensor<TA, ALayout> const& A,
     Tensor<TB, BLayout> const& B,
     Tensor<TC, CLayout>      & C)
{
  return gemm(C, A, B, C);
}

striker159 · November 29, 2023, 9:36am

The gemm() functions are all defined in this file. You can look at the number of arguments and type of arguments to identify them.
The next one is a function with 4 arguments of type Tensor. https://github.com/NVIDIA/cutlass/blob/56fc3df03b57c5e1a825ec747799bc0f0df4b860/include/cute/algorithm/gemm.hpp#L113 or https://github.com/NVIDIA/cutlass/blob/56fc3df03b57c5e1a825ec747799bc0f0df4b860/include/cute/algorithm/gemm.hpp#L161

Eventually, you will reach the only gemm() functions which does not contain another gemm call.https://github.com/NVIDIA/cutlass/blob/56fc3df03b57c5e1a825ec747799bc0f0df4b860/include/cute/algorithm/gemm.hpp#L197

202476410arsmart · November 29, 2023, 9:41am

Wow, that’s cool! How clever you are!!!

202476410arsmart · December 1, 2023, 4:04am

Wait, there is still one question, so now you are here:

github.com

NVIDIA/cutlass/blob/10b850f9c7e2ee11577a8d710e15f05343577208/examples/cute/tutorial/sgemm_nt_1.cu#L210


      
              // cp.async_wait_group 0.  This should make the first
              // cp_async_fence() (which also issues cp.async.commit_group)
              // redundant.  The tutorial works as-is, so we'll leave the
              // redundant fence in for now and study its removal later.
              cp_async_fence();
              cp_async_wait<0>();
          
              __syncthreads();
          
              // Compute gemm on smem
              gemm(tCsA, tCsB, tCrC);
          
              __syncthreads();
            }
          
          #endif
          
            //
            // Epilogue
            //

And how you know, gemm is from include <cute/tensor.hpp>???

Maybe other files like: include <thrust/host_vector.h> include <thrust/device_vector.h>???

It is very hard to verify! Because each file can have even more linkages, and you can not go through everyone to exclude!

striker159 · December 1, 2023, 5:25am

<thrust/host_vector.h> and <thrust/device_vector.h> have nothing to do with matrix multiplication. this leaves only <cute/tensor.hpp>

I already suggested using a debugger to step through the code. You can verify that you will reach the file with the gemm calls we have discussed so far.

202476410arsmart · December 1, 2023, 7:54am

Thank you! Actually I do tried…but…

nvcc -o sgemm_nt_1 sgemm_nt_1.cu -arch=sm_80 -std=c++17 -I …/cutlass/include -I …/cutlass/tools/util/include --expt-relaxed-constexpr -O0 -g

(cuda-gdb) break sgemm_nt_1.cu:210
Breakpoint 1 at 0xd907: file sgemm_nt_1.cu, line 222.

You see, here I want to break at 210 line (which is gemm), but it gives me breakpoint at 222 line??? Why???

striker159 · December 1, 2023, 7:58am

Your compile command does not generate debug symbols for device code. You need to pass -G too.

For the lines, just try and run it. If it does not work, you can instruct cuda-gdb to always break on the first instruction of a kernel. Then the code will be loaded and kernel lines may work. CUDA-GDB

202476410arsmart · December 3, 2023, 7:04am

Thank you! It works!!! Mostly…

So CUDA-GDB finds the function step by step, but failed at last step… Like below:

(cuda-gdb) break sgemm_nt_1.cu:210
Breakpoint 1 at 0xd907: file sgemm_nt_1.cu, line 222.
(cuda-gdb) run
Starting program: /home/zyhuang/temp_can/sgemm_nt_1 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
cuda-gdb failed to grab the lock file /tmp/cuda-dbg/cuda-gdb.lock.
Another CUDA debug session (pid 97601) could be in progress.
Are you sure you want to continue? (y or [n]) y
[Detaching after fork from child process 98556]
[New Thread 0x7fffdffff000 (LWP 98560)]
[New Thread 0x7fffdf7fe000 (LWP 98561)]
Using device 0: NVIDIA A100 80GB PCIe  (SM80, 108 SMs)
M = 5120
N = 5120
K = 4096
Verification by comparison with cuBLAS is disabled, either because the CMake option CUTLASS_ENABLE_CUBLAS was explicitly set to OFF, or because CMake could not find cuBLAS.  If you would like to enable verification with cuBLAS, please set the CMake option CUTLASS_ENABLE_CUBLAS to ON, rerun CMake, and recompile this example.
ahahahaahahhahahahhhhhhhhhhhhhhh
[Switching focus to CUDA kernel 0, grid 1, block (2,7,0), thread (0,0,0), device 0, sm 66, warp 18, lane 0]

Thread 1 "sgemm_nt_1" hit Breakpoint 1, gemm_device<int, int, int, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<32>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<32> > >, float, cute::tuple<cute::C<1>, int>, cute::Layout<cute::tuple<cute::C<128>, cute::C<128> >, cute::tuple<cute::C<1>, cute::C<128> > >, cute::Layout<cute::tuple<cute::C<16>, cute::C<16> >, cute::tuple<cute::C<1>, cute::C<16> > >, float, float>
   <<<(40,40,1),(256,1,1)>>> (M=-218129807, N=32767, K=0, A=0x7fff84000000, dA=..., blockA=..., tA=..., 
    B=0x7fff96000000, dB=..., blockB=..., tB=..., C=0x7fff8e000000, dC=..., tC=..., alpha=0, 
    beta=-1.0125765e+31) at sgemm_nt_1.cu:210
210         gemm(tCsA, tCsB, tCrC);
(cuda-gdb) s
cute::gemm<cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > ><<<(40,40,1),(256,1,1)>>> (
    A=..., B=..., C=...) at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:74
74        return gemm(C, A, B, C);
(cuda-gdb) s
cute::gemm<cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > > (D=..., A=..., 
    B=..., C=...) at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:171
171       return gemm(MMA{}, D, A, B, C);
(cuda-gdb) s
cute::gemm<cute::UniversalFMA<float, float, float, float>, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, (void*)0> (mma=..., D=..., A=..., B=..., C=...)
    at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:454
454       gemm(mma,
(cuda-gdb) s
455            make_tensor(D.data(), prepend<3>(D.layout())),      // (1,M,N)
(cuda-gdb)

You see, here the last function is 454 line’s gemm(mma, … And actually you can see it should be directed to here like you told me. Why not? (Maybe some bugs in cuda-gdb?)

striker159 · December 3, 2023, 9:59am

The function call spans multiple lines. All lines will be processed by the debugger before entering the function.

202476410arsmart · December 3, 2023, 11:19am

Oh! Good answer!! Thank you!!!

(cuda-gdb) step
cute::gemm<cute::UniversalFMA<float, float, float, float>, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ViewEngine<cute::smem_ptr<float*> >, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<16>, cute::C<128> > >, cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, (void*)0> (mma=..., D=..., A=..., B=..., C=...)
    at /home/zyhuang/temp_can/../cutlass/include/cute/algorithm/gemm.hpp:454
454       gemm(mma, make_tensor(D.data(), prepend<3>(D.layout())), make_tensor(A.data(), prepend<3>(A.layout())), make_tensor(B.data(), prepend<3>(B.layout())), make_tensor(C.data(), prepend<3>(C.layout())));     // (1,M,N)
(cuda-gdb) step
cute::Tensor<cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > >::data (this=0x7ffff2fff1a8)
    at /home/zyhuang/temp_can/../cutlass/include/cute/tensor.hpp:166
166         return engine().begin();
(cuda-gdb) step
cute::Tensor<cute::ArrayEngine<float, 64>, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > > >::engine (this=0x4) at /home/zyhuang/temp_can/../cutlass/include/cute/tensor.hpp:154
154         return get<1>(rep_);
(cuda-gdb) step
cute::get<1ul, cute::Layout<cute::tuple<cute::C<8>, cute::C<8> >, cute::tuple<cute::C<1>, cute::C<8> > >, cute::ArrayEngine<float, 64> > (t=...) at /home/zyhuang/temp_can/../cutlass/include/cute/container/tuple.hpp:213
213       return detail::getv<I>(t);

Well, I changed gemm code into one line like:

gemm(mma, make_tensor(D.data(), prepend<3>(D.layout())), make_tensor(A.data(), prepend<3>(A.layout())), make_tensor(B.data(), prepend<3>(B.layout())), make_tensor(C.data(), prepend<3>(C.layout())));     // (1,M,N)

But still it can not enter this gemm…? Do you possibly know why?

striker159 · December 3, 2023, 12:05pm

If I counted correctly, this one line contains 16 function calls which happen before gemm.

202476410arsmart · December 4, 2023, 2:25am

Emmm… So I want, after I enter 454 line, gemm(mma, …

I want to directly enter this gemm, but not enter sth strange, like engine().begin();

Do you possibly know how?

Thank you!!!

striker159 · December 7, 2023, 7:53am

It’s not strange. When you step through code it will follow the code exactly like it would be executed. You could simply continue stepping until all arguments of gemm have been computed and you reach gemm.
Or you can set a break point directly at the start of the gemm function.

This is no different than ordinary host debugging.

JiamingCheng · October 10, 2024, 1:23pm

now,do you find the code?

Topic		Replies	Views
Why I can not enter correct upper function using cuda-gdb? CUDA-GDB	7	798	January 19, 2024
Why using -g -G, the code will be blocked? CUDA Programming and Performance	4	528	December 4, 2023
Strange "unspecified launch error" from a call to cublas gemm CUDA Programming and Performance	23	2602	January 19, 2019
CUDA 2.1 Beta Problem/Bugs (Linux) CUDA Programming and Performance	5	1645	January 6, 2009
cuBLAS call from kernel in CUDA 10.0 GPU-Accelerated Libraries	9	4777	April 7, 2021
CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation CUDA Programming and Performance	44	29433	April 29, 2010
Where can I find working examples for the new cuBLASLt library? GPU-Accelerated Libraries	35	5709	March 16, 2020
cublasSgemm results in null matrix CUDA Programming and Performance	5	757	May 28, 2019
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	53617	August 1, 2011
cublas<S,D,C,Z>GEMM crash on multi GPUs CUDA Programming and Performance	8	1879	November 27, 2014

Where is cute's gemm code?

Related topics