Originally published at: https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/ Update May 21, 2018: CUTLASS 1.0 is now available as Open Source software at the CUTLASS repository. CUTLASS 1.0 has changed substantially from our preview release described in the blog post below. We have de…

Thanks for the great tutorial! I am trying to understand better what "fusing element-wise operation" means. I implement lot's of custom LSTM with pytorch (and fusing is a big problem if I understand stuff correctly). I don't write CUDA codes, so the explanation in the tutorial about the gemm::epilo…

Hi Alain, That's correct. Although, ReLU(C) wouldn't need to stage through shared memory since each element is only accessed once. But it saves a load and store of C. Niall

Thanks for the write up! But I don't quite get the essence of the thread tile. In figure 5, it seems that one thread is responsible for calculating the outer product for 4 locations in the warp accumulator, I don't understand where the 8x8 matrix (on the right of fig 5) comes from? Also, to my under…

Hello Andrew. This is Isaac from GTC who had a fortune of talking to you about CUTLASS. Your explanation of CUTLASS was extremely helpful. Thanks so much. In the "Complete GEMM" block code, this line: accumulator[thread_x][thread_y] += frag_a[y]*frag_b[x]; seems to contain a typo. Should y be repl…

Hello Andrew, I'm somewhat confused as to how you're getting simultaneous global load and computation in the same CTA (Software Pipelining) when those sections are separated by a syncthreads in your GEMM pseudo-code. My understanding was that in the following setup, all threads in a CTA must either…

Replying for Andrew: Two buffers in shared memory are allocated. One is actively being written by values fetched from global memory loads (the threadblock tile), while the other SMEM buffer is being loaded from into registers (the warp-scoped tile). At the appropriate point in the mainloop body, all…

While this doesn't directly answer my question, I like the solution you have presented here. It addresses the main issue we have with using shared memory. All of the simple examples that I've seen that use shared memory seem to throw away the nice latency hiding feature of the GPU with naive usage…

Thanks for the feedback! I'll pass this along to Andrew.

Sorry for digging this up, but I am really confused by Figure 5. [image] In particular, I don’t understand how we got the “8-by-8 overall thread tile”. There are a total of 4*8=32 threads in the wrap, each computing a 2-by-2 block (there are 4 green cells on the left). How do we get 8-by-8 from t…

Hi! I am deeply impressed by the “different policy for different type of SGEMM”. But I can not find to decide which size is which type, such as “tall”, “large”, and so on. Is it possible to provide me a link for the code to decide the SGEMM type? Thank you!!!

CUTLASS: Fast Linear Algebra in CUDA C++

Technical Blogs & Events Technical Blog

igor_furoa September 9, 2024, 2:04pm 15

The state-of-the-art architecture at the time of that post was Volta, which had Tensor Cores, each capable of doing 64 fused multiply-add (FMA) operations per clock. That’s why the thread tile was organized in a 8 x 8 grid. Ampere did 256 FMA operations per clock. Here’s a good post explaining more.

Topic		Replies	Views
Just Released: CUTLASS 3.8 Technical Blog	0	411	February 3, 2025
Implementing High Performance Matrix Multiplication Using CUTLASS v2.8 Technical Blog	0	563	November 23, 2021
CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels Technical Blog	1	62	June 6, 2026
my speedy SGEMM CUDA Programming and Performance	91	277230	May 29, 2013
CUTLASS: Fast Linear Algebra in CUDA C++ Technical Blog	0	469	August 21, 2022
Are there any blogs about rasterization and swizzle in cutlass? CUDA NVCC Compiler cuda	1	98	August 11, 2025
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2424	June 8, 2021
How to use slicedK in GEMM? CUDA Programming and Performance	1	1267	June 27, 2022
Where is cute's gemm code? CUDA Programming and Performance	20	2875	October 13, 2024
Where does cutlass' detailed GEMM kernel? GPU-Accelerated Libraries cutlass	4	1126	June 16, 2022

CUTLASS: Fast Linear Algebra in CUDA C++

Related topics