Hi! I am learning cutlass. And I read this post: CUTLASS: Fast Linear Algebra in CUDA C++ | NVIDIA Technical Blog
But I can not find official “dispatch_policies.h”, only find one in huggingface’s github!
pytorch_block_sparse/dispatch_policies.h at master · huggingface/pytorch_block_sparse · GitHub
Actually I am developing a kernel related to “tall” matmul, which has small N and large M for resulting C(m*n). So I am quite interested in the parameters here:(as shown in the post)
// GEMM task policy specialization for tall SGEMM
template <>template <>
struct gemm_policy<float, float, problem_size_t::Tall> :struct gemm_policy<float, float, problem_size_t::Tall> :
block_task_policy<<
128, // BlockItemsY - Height in rows of a tile128, // BlockItemsY - Height in rows of a tile
32, // BlockItemsX - Width in columns of a tile32, // BlockItemsX - Width in columns of a tile
8, // ThreadItemsY - Height in rows of a thread-tile 8, // ThreadItemsY - Height in rows of a thread-tile
4, // ThreadItemsX - Width in columns of a thread-tile4, // ThreadItemsX - Width in columns of a thread-tile
8, // BlockItemsK - Depth of a tile8, // BlockItemsK - Depth of a tile
true, // UseDoubleScratchTiles - whether to double-buffer SMEMtrue, // UseDoubleScratchTiles - whether to double-buffer SMEM
grid_raster_strategy::Default> // Grid rasterization strategy::Default> // Grid rasterization strategy
{};{};
My question is: how many thread and block are allocated here? I really can not find any info nearby…
Thank you!!!