|
cuBLAS severe underperformance on cublasSgemm for RTX 3060 Laptop GPU
|
|
1
|
56
|
May 14, 2026
|
|
Clangd intellisense error for cutlass/Cute (Windows/VSCode)
|
|
0
|
56
|
January 7, 2026
|
|
Switch from "sm90_xmma_gemm.._cublas"/ "void cutlass::Kernel<cutlass_80_tensorop_.." kernels with CUDA-12.1 to "nvjet_tst..." kernels with CUDA-12.8
|
|
0
|
248
|
October 26, 2025
|
|
NVSHMEM issue with warpgroup all reduce
|
|
2
|
125
|
October 9, 2025
|
|
Why is cuBLAS cublasDgemm slower than my naive GEMM kernel?
|
|
1
|
121
|
September 15, 2025
|
|
Adding ThreadblockSwizzle
|
|
0
|
70
|
September 4, 2025
|
|
How to Map CUTLASS AND CuTe Layouts to Linear Indexes (Hierarchical)
|
|
1
|
335
|
May 24, 2025
|
|
Error compiling cuFFTDx code: ‘cudafe++’ died with status 0xC0000409
|
|
2
|
190
|
October 8, 2024
|
|
H100 PCIe hgemm cannot reach peak performance
|
|
3
|
670
|
May 6, 2024
|
|
GEMM stage on ampere
|
|
0
|
399
|
March 12, 2024
|
|
[cuBLASDx] no instance of overloaded function "__half::__half" matches the specified type
|
|
2
|
778
|
January 30, 2024
|
|
How to enable Tensor core for cublasSgemmBatched on H100?
|
|
4
|
1123
|
November 3, 2023
|
|
Cutlasss Functionality for SIMT
|
|
1
|
409
|
October 30, 2023
|
|
Is there any official benchmark tool to test a GPU's FLOPS?
|
|
3
|
7160
|
October 24, 2023
|
|
Cutlass not working in ARM-based machine
|
|
1
|
508
|
April 12, 2023
|
|
What does "sliced1x4_nn" mean in matmul?
|
|
0
|
685
|
June 17, 2022
|
|
What is "custom" "custom-back" size for SGEMM in cutlass?
|
|
0
|
577
|
June 16, 2022
|
|
Where does cutlass' detailed GEMM kernel?
|
|
4
|
1125
|
June 16, 2022
|
|
How many threads and blocks does cutlass use? (When C is tall in official post)
|
|
1
|
716
|
June 14, 2022
|
|
How to compile cutlass app using JIT
|
|
1
|
1006
|
May 23, 2022
|
|
Using CUTLASS to get inverse of a matrix
|
|
1
|
1351
|
December 7, 2021
|
|
Understanding cutlass GEMM hierarchy
|
|
1
|
3737
|
October 14, 2021
|