Hi, I’m trying to multiply a dense 16000x8000 matrix by a 8000x1 vector based on the example from matmulCUBLAS, and I’m getting fairly poor performance compared to example code. All I changed was the height and width of the matrices to this:

matrix_size.uiWA = 8000;

matrix_size.uiHA = 16000;

matrix_size.uiWB = 1;

matrix_size.uiHB = 8000;

matrix_size.uiWC = 1;

matrix_size.uiHC = 16000;

and the cublas call to this:

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiHA, matrix_size.uiWB, matrix_size.uiWA, &alpha, d_A, matrix_size.uiHA, d_B, matrix_size.uiHB, &beta, d_C, matrix_size.uiHC);

It’s only showing about 3GFLOPs of performance, while the 320x640 example code was doing about 1.4TFLOPs. Am I doing something wrong, or is a matrix this large just inherently slower?