How many threads and blocks does cutlass use? (When C is tall in official post)

Hi! I am learning cutlass. And I read this post: CUTLASS: Fast Linear Algebra in CUDA C++ | NVIDIA Technical Blog
But I can not find official “dispatch_policies.h”, only find one in huggingface’s github!
pytorch_block_sparse/dispatch_policies.h at master · huggingface/pytorch_block_sparse · GitHub

Actually I am developing a kernel related to “tall” matmul, which has small N and large M for resulting C(m*n). So I am quite interested in the parameters here:(as shown in the post)

// GEMM task policy specialization for tall SGEMM
template <>template <>
struct gemm_policy<float, float, problem_size_t::Tall> :struct gemm_policy<float, float, problem_size_t::Tall> :
    block_task_policy<<
        128,       // BlockItemsY - Height in rows of a tile128,       // BlockItemsY - Height in rows of a tile
        32,        // BlockItemsX - Width in columns of a tile32,        // BlockItemsX - Width in columns of a tile
        8,         // ThreadItemsY - Height in rows of a thread-tile 8,         // ThreadItemsY - Height in rows of a thread-tile 
        4,         // ThreadItemsX - Width in columns of a thread-tile4,         // ThreadItemsX - Width in columns of a thread-tile
        8,         // BlockItemsK - Depth of a tile8,         // BlockItemsK - Depth of a tile
        true,      // UseDoubleScratchTiles - whether to double-buffer SMEMtrue,      // UseDoubleScratchTiles - whether to double-buffer SMEM
        grid_raster_strategy::Default>   // Grid rasterization strategy::Default>   // Grid rasterization strategy
{};{};

My question is: how many thread and block are allocated here? I really can not find any info nearby…

Thank you!!!

You can use a profiler to discover this information, if you’re not able to parse the source code.

1 Like