cublasSgemv performance question

Significant slowdown on smaller-sized problem. Each case involved a single call to cublasSgemv, but notice that gemvNSP kernel is used for the smaller problem. Any insight into what is going on, and how we can achieve better performance on the smaller problem? CUDA toolkit 8.0, geforce gtx 550ti

Large problem

X1 = 1000 x 100000 matrix
v1 = 100000 x 1 vector

Time(%)      Time     Calls       Avg       Min       Max  Name
 22.59%  22.772ms         1  22.772ms  22.772ms  22.772ms  void gemv2N_kernel_val<float, float, float, int=128, int=8, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)

Small problem
X2 = 500 x 100000 matrix
v2 = 100000 x 1 vector

Time(%)      Time     Calls       Avg       Min       Max  Name
 41.71%  28.647ms         1  28.647ms  28.647ms  28.647ms  void gemvNSP_kernel_val<float, float, float, int=1, int=16, int=4, int=1024>(float, float, cublasGemv2Params_v2<float, float, float>)

I recommend filing a bug. Instructions are here:

https://devtalk.nvidia.com/default/topic/1044668/cuda-programming-and-performance/-how-to-report-a-bug/

If you do file a bug, please let me know what bug number the system assigns to it.

Also note that your device is a Fermi device and the last CUDA toolkit to support it is CUDA 8.0

If any improvements were made by the development team, they would not be provided in CUDA toolkit 8.

Thanks Robert. Bug ID 2460283

Are the two SGEMV calls using the same transpose mode, and the same vector increments?

cublasStatus_t cublasSgemv(cublasHandle_t handle, 
                           cublasOperation_t trans,               // <<<<<<<<
                           int m, int n,
                           const float           *alpha,
                           const float           *A, 
                           int lda,
                           const float           *x, 
                           int incx,                              // <<<<<<<<<
                           const float           *beta,
                           float           *y, 
                           int incy)                              // <<<<<<<<<

GEMV is a memory-bound operation, and transpose mode and vector increments impact the resulting memory access patterns and thereby performance.

I was able to reproduce the observation in several different cases, across CUDA 9,10, on GTX960 and Tesla P100. It seems like a legitimate issue which is why I suggested filing a bug. using CUBLAS_OP_N throughout.

Here is a gtx 960/CUDA 9 example:

$ cat t377.cu
#include <cublas_v2.h>
#include <assert.h>

const int d1 = 100000;
#ifndef ROWS
#define ROWS 1000;
#endif
const int d2 = ROWS;
int main(){

  float *X1, *v1, *y;
  cudaMalloc(&X1, d1*d2*sizeof(float));
  cudaMalloc(&v1, d1*sizeof(float));
  cudaMalloc(&y,  d2*sizeof(float));
  const float alpha = 1.0f;
  const float beta = 0.0f;
  cublasHandle_t h;
  cublasCreate(&h);
  cublasStatus_t stat = cublasSgemv(h, CUBLAS_OP_N, d2, d1, &alpha, X1, d2, v1, 1, &beta, y, 1);
  assert(stat == CUBLAS_STATUS_SUCCESS);
  cublasSgemv(h, CUBLAS_OP_N, d2, d1, &alpha, X1, d2, v1, 1, &beta, y, 1);
  cudaDeviceSynchronize();
}
$ nvcc -o t377 t377.cu -lcublas -DROWS=1000
$ nvprof ./t377
==8107== NVPROF is profiling process 8107, command: ./t377
==8107== Profiling application: ./t377
==8107== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  11.817ms         2  5.9086ms  5.8425ms  5.9748ms  void gemv2N_kernel_val<float, float, float, int=128, int=8, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
                    0.01%  1.4720us         1  1.4720us  1.4720us  1.4720us  [CUDA memcpy HtoD]
      API calls:   73.75%  685.04ms         1  685.04ms  685.04ms  685.04ms  cudaFree
                   24.80%  230.39ms         6  38.398ms  23.754us  229.23ms  cudaMalloc
                    1.27%  11.790ms         1  11.790ms  11.790ms  11.790ms  cudaDeviceSynchronize
                    0.11%  1.0611ms       185  5.7350us     280ns  232.67us  cuDeviceGetAttribute
                    0.02%  166.03us         2  83.013us  78.396us  87.630us  cuDeviceTotalMem
                    0.02%  163.67us         2  81.835us  70.016us  93.655us  cuDeviceGetName
                    0.01%  106.89us         1  106.89us  106.89us  106.89us  cudaMemcpy
                    0.01%  100.25us         2  50.122us  21.209us  79.036us  cudaLaunch
                    0.00%  29.188us        16  1.8240us  1.1690us  6.4150us  cudaEventCreateWithFlags
                    0.00%  11.524us        11  1.0470us     545ns  5.0950us  cudaDeviceGetAttribute
                    0.00%  8.5100us         6  1.4180us     260ns  5.8550us  cudaSetupArgument
                    0.00%  7.3390us         4  1.8340us     350ns  5.1590us  cuDeviceGetCount
                    0.00%  3.6790us         1  3.6790us  3.6790us  3.6790us  cudaGetDevice
                    0.00%  3.6000us         3  1.2000us     540ns  2.4450us  cuDeviceGet
                    0.00%  2.4100us         2  1.2050us     505ns  1.9050us  cudaConfigureCall
                    0.00%  1.2840us         1  1.2840us  1.2840us  1.2840us  cuInit
                    0.00%  1.2800us         2     640ns     540ns     740ns  cudaGetLastError
                    0.00%     900ns         1     900ns     900ns     900ns  cuDriverGetVersion
$ nvcc -o t377 t377.cu -lcublas -DROWS=500
$ nvprof ./t377
==8156== NVPROF is profiling process 8156, command: ./t377
==8156== Profiling application: ./t377
==8156== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  22.114ms         2  11.057ms  10.892ms  11.221ms  void gemv2N_kernel_val<float, float, float, int=128, int=32, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
                    0.01%  1.4080us         1  1.4080us  1.4080us  1.4080us  [CUDA memcpy HtoD]
      API calls:   72.18%  682.47ms         1  682.47ms  682.47ms  682.47ms  cudaFree
                   25.30%  239.23ms         6  39.871ms  23.219us  238.09ms  cudaMalloc
                    2.34%  22.089ms         1  22.089ms  22.089ms  22.089ms  cudaDeviceSynchronize
                    0.12%  1.1013ms       185  5.9520us     275ns  262.87us  cuDeviceGetAttribute
                    0.02%  182.88us         2  91.440us  70.462us  112.42us  cuDeviceGetName
                    0.02%  170.84us         2  85.420us  81.475us  89.365us  cuDeviceTotalMem
                    0.01%  100.56us         2  50.280us  21.879us  78.681us  cudaLaunch
                    0.01%  51.517us         1  51.517us  51.517us  51.517us  cudaMemcpy
                    0.00%  29.463us        16  1.8410us  1.0800us  7.1350us  cudaEventCreateWithFlags
                    0.00%  11.054us        11  1.0040us     540ns  4.7140us  cudaDeviceGetAttribute
                    0.00%  9.3050us         3  3.1010us     545ns  8.2000us  cuDeviceGet
                    0.00%  8.3950us         6  1.3990us     255ns  5.4200us  cudaSetupArgument
                    0.00%  7.5350us         4  1.8830us     325ns  5.0550us  cuDeviceGetCount
                    0.00%  4.1200us         1  4.1200us  4.1200us  4.1200us  cudaGetDevice
                    0.00%  3.2090us         2  1.6040us     485ns  2.7240us  cudaConfigureCall
                    0.00%  1.1050us         2     552ns     475ns     630ns  cudaGetLastError
                    0.00%  1.0350us         1  1.0350us  1.0350us  1.0350us  cuDriverGetVersion
                    0.00%  1.0050us         1  1.0050us  1.0050us  1.0050us  cuInit
$

In the above test case, the gemv kernel with 1000 rows takes ~6ms and the gemv kernel with 500 rows takes ~11ms.

Observation is essentially identical under CUDA 10.

Probably a bad or outdated heuristic for picking the kernel then.