I was trying to build a simple linear regression application. It uses only 2 sGEMM from cuBLAS.
I noticed that, from nvvp, the performance is not so good: around 40% peak FLOPS achieved. When I was testing cuBLAS (using some other data), the results were much better.
I wonder what can I do to improve the performance?
FWIW, I wrote wrappers to keep the data is row-major layout, i.e., if I copy the data from the device after calculation, the data will be in row major. (done by exchanging the order of matrices when calling gemm).
I also saw that, when I change matrix size, the function name changed from magma_lds128_… to sgemm_sm35_ldg_nn_64x16x64x16x16. Should this concern me?
Thanks a lot for your answering!