Where does cutlass' detailed GEMM kernel?

202476410arsmart · June 14, 2022, 3:39pm

Hi! I am learning cutlass, and I see something like: (from official post)

/// CUTLASS SGEMM example
__global__ void gemm_kernel(void gemm_kernel(
    float *C, float *C, 
    float const *A, float const *A, 
    float const *B, float const *B, 
    int M, int M, 
    int N, int N, 
    int K) {int K) {

    // Define the GEMM tile sizes - discussed in next section// Define the GEMM tile sizes - discussed in next section
    typedef block_task_policy <typedef block_task_policy <
        128, // BlockItemsY: Height in rows of a tile128, // BlockItemsY: Height in rows of a tile
        32, // BlockItemsX - Width in columns of a tile32, // BlockItemsX - Width in columns of a tile
        8, // ThreadItemsY - Height in rows of a thread-tile8, // ThreadItemsY - Height in rows of a thread-tile
        4, // ThreadItemsX - Width in columns of a thread-tile4, // ThreadItemsX - Width in columns of a thread-tile
        8, // BlockItemsK - Depth of a tile8, // BlockItemsK - Depth of a tile
        true, // UseDoubleScratchTiles - whether to double-buffer SMEMtrue, // UseDoubleScratchTiles - whether to double-buffer SMEM
        block_raster_enum::Default // Block rasterization strategy::Default // Block rasterization strategy
    > block_task_policy_t;> block_task_policy_t;

    // Define the epilogue functor// Define the epilogue functor
    typedef gemm::blas_scaled_epilogue<float, float, float> epilogue_op_t ;typedef gemm::blas_scaled_epilogue<float, float, float> epilogue_op_t ;

    // Define the block_task type.// Define the block_task type.
    typedef block_task < typedef block_task < 
        block_task_policy_t, block_task_policy_t, 
        float, float, 
        float, float, 
        matrix_transform_t::NonTranspose, matrix_transform_t::NonTranspose, 
        4, 4, 
        matrix_transform_t::NonTranspose, matrix_transform_t::NonTranspose, 
        4, 4, 
        epilogue_op_t, epilogue_op_t, 
        4, 4, 
        true true 
    > block_task_t;> block_task_t;

    // Declare statically-allocated shared storage// Declare statically-allocated shared storage
    __shared__ block_task_t::scratch_storage_t smem;block_task_t::scratch_storage_t smem;

    // Construct and run the task// Construct and run the task
    block_task_t(block_task_t(
        reinterpret_cast(&smem),reinterpret_cast(&smem),
        &smem,&smem,
        A,,
        B,,
        C,,
        epilogue_op_t(1, 0),epilogue_op_t(1, 0),
        M,,
        N,,
        K).run();).run();
}}

To guide usage…of which can not see the base level implementation of the GEMM. I guess there should exist! But the github page of cutlass is kind of…messy…I tried hard myself!! But really can not find…

Could anyone kindly provide me a link? Thank you!!!

================

By the way, I see a “naive gemm” in cutlass github. Sorry, that is not what I want! Haha!

mnicely · June 14, 2022, 4:13pm

What exactly are you looking for?
Also, you may want to direct your questions to the CUTLASS Github, as it is monitored by the engineering team.

202476410arsmart · June 15, 2022, 3:36am

I am looking for GEMM implementation. Here only provides how to use GEMM wrapper. But I want real GEMM kernel. Thanks!

Robert_Crovella · June 15, 2022, 8:56pm

so, follow the path given to you, that you have already shown. locate the .run() method.

202476410arsmart · June 16, 2022, 5:34am

Well, I am actually finding the whole code to run, also the method… Good news is, I have found them! Just need to include the correct downloaded cutlass library, and then compile the correct code. And you will get an exe file! That’s it!

I reply this because I think my answer can help future user and improve the community. Your reply is also excellent! I also get help from github page. Thank you!!!

Topic		Replies	Views
Understanding cutlass GEMM hierarchy GPU-Accelerated Libraries cutlass	1	3667	October 14, 2021
Generic DGEMM implementation CUDA Programming and Performance	2	5799	February 11, 2009
speedy CGEMM reaches 448 Gflop/s CUDA Programming and Performance	1	2779	March 22, 2010
Hand-Tuned SGEMM on GT200 GPU 10% ~ 20% improvement of SGEMM CUDA Programming and Performance	39	69420	March 1, 2011
CGEMM problems CUDA Programming and Performance	14	6718	February 2, 2011
Generalized SGMM CUDA Programming and Performance	5	1659	June 14, 2010
my speedy SGEMM CUDA Programming and Performance	91	276278	May 29, 2013
Where is cute's gemm code? CUDA Programming and Performance	20	2573	October 13, 2024
tuning SGEMM CUDA Programming and Performance	0	842	February 5, 2011
How threads/blocks are mapped on GPU while calling cublasSgemm routines? GPU-Accelerated Libraries	0	1113	February 13, 2013

Where does cutlass' detailed GEMM kernel?

Related topics