Calling cgemm functions

I would like to know if it is possible to write a sample code that directly calls cgemm functions directly? For example, I want to analyze cgemm_64_32_tn with a sample input.

I haven’t seen a guide for that. Any help?

Hi,

In this case you need to use cuBLAS.
Please refer to below link for more details:
https://docs.nvidia.com/cuda/cublas/index.html

Thanks

Thank you. You are right. I tried cublasSgemm and played with some inputs. For example, I see that profiling the multiplication of two FP squared matrices (1000x1000) and writing the results in a 1000x1000 have the following statistics.

Kernel: volta_sgemm_128x32_nn
1 dram_read_transactions Device Memory Read Transactions 586589 586589 586589
1 dram_write_transactions Device Memory Write Transactions 483562 483562 483562
1 flop_count_sp Floating Point Operations(Single Precision) 2157969408 2157969408 2157969408

On average (586589+483562)*32 or 34,244,832 bytes are read/written from/on DRAM.

With pencil and paper, we know each matrix contains 100010004 or 4,000,000 bytes. Two DRAM reads and one DRAM write yields us 12,000,000 bytes.

Some differences are acceptable since the exact implementation of volta_sgemm_128x32_nn is unknown.
However, one can say the data movement of volta_sgemm_128x32_nn means an inefficient algorithm.

Any comment on that?