How many threads and blocks does cutlass use? (When C is tall in official post)

202476410arsmart · June 14, 2022, 3:35pm

Hi! I am learning cutlass. And I read this post: https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
But I can not find official “dispatch_policies.h”, only find one in huggingface’s github!
pytorch_block_sparse/dispatch_policies.h at master · huggingface/pytorch_block_sparse · GitHub

Actually I am developing a kernel related to “tall” matmul, which has small N and large M for resulting C(m*n). So I am quite interested in the parameters here:(as shown in the post)

// GEMM task policy specialization for tall SGEMM
template <>template <>
struct gemm_policy<float, float, problem_size_t::Tall> :struct gemm_policy<float, float, problem_size_t::Tall> :
    block_task_policy<<
        128,       // BlockItemsY - Height in rows of a tile128,       // BlockItemsY - Height in rows of a tile
        32,        // BlockItemsX - Width in columns of a tile32,        // BlockItemsX - Width in columns of a tile
        8,         // ThreadItemsY - Height in rows of a thread-tile 8,         // ThreadItemsY - Height in rows of a thread-tile 
        4,         // ThreadItemsX - Width in columns of a thread-tile4,         // ThreadItemsX - Width in columns of a thread-tile
        8,         // BlockItemsK - Depth of a tile8,         // BlockItemsK - Depth of a tile
        true,      // UseDoubleScratchTiles - whether to double-buffer SMEMtrue,      // UseDoubleScratchTiles - whether to double-buffer SMEM
        grid_raster_strategy::Default>   // Grid rasterization strategy::Default>   // Grid rasterization strategy
{};{};

My question is: how many thread and block are allocated here? I really can not find any info nearby…

Thank you!!!

Robert_Crovella · June 14, 2022, 4:10pm

You can use a profiler to discover this information, if you’re not able to parse the source code.

Topic		Replies	Views
How to determine a good ThreadblockShape in CUTLASS CUDA Programming and Performance	0	833	November 18, 2021
Where does cutlass' detailed GEMM kernel? GPU-Accelerated Libraries cutlass	4	1119	June 16, 2022
CUTLASS: Division by Zero when using smaller threadtile sizes GPU-Accelerated Libraries	0	423	May 15, 2019
Trade-off within gemm block size of cutlass CUDA Programming and Performance	2	352	December 3, 2023
CUTLASS: Fast Linear Algebra in CUDA C++ Technical Blog	13	2216	September 9, 2024
cuBLAS launch 5 times threads blocks more than expected GPU-Accelerated Libraries cublas	4	514	April 11, 2024
What is "custom" "custom-back" size for SGEMM in cutlass? GPU-Accelerated Libraries cutlass	0	576	June 16, 2022
Query regarding launch_block_size and launch_thread_count reported by Nsight Compute for CUDA kernel Nsight Compute	3	914	March 31, 2023
Are there any blogs about rasterization and swizzle in cutlass? CUDA NVCC Compiler cuda	1	90	August 11, 2025
Just Released: CUTLASS 3.8 Technical Blog	1	402	February 4, 2025

How many threads and blocks does cutlass use? (When C is tall in official post)

Related topics