I was able to reproduce the observation in several different cases, across CUDA 9,10, on GTX960 and Tesla P100. It seems like a legitimate issue which is why I suggested filing a bug. using CUBLAS_OP_N throughout.
Here is a gtx 960/CUDA 9 example:
$ cat t377.cu
#include <cublas_v2.h>
#include <assert.h>
const int d1 = 100000;
#ifndef ROWS
#define ROWS 1000;
#endif
const int d2 = ROWS;
int main(){
float *X1, *v1, *y;
cudaMalloc(&X1, d1*d2*sizeof(float));
cudaMalloc(&v1, d1*sizeof(float));
cudaMalloc(&y, d2*sizeof(float));
const float alpha = 1.0f;
const float beta = 0.0f;
cublasHandle_t h;
cublasCreate(&h);
cublasStatus_t stat = cublasSgemv(h, CUBLAS_OP_N, d2, d1, &alpha, X1, d2, v1, 1, &beta, y, 1);
assert(stat == CUBLAS_STATUS_SUCCESS);
cublasSgemv(h, CUBLAS_OP_N, d2, d1, &alpha, X1, d2, v1, 1, &beta, y, 1);
cudaDeviceSynchronize();
}
$ nvcc -o t377 t377.cu -lcublas -DROWS=1000
$ nvprof ./t377
==8107== NVPROF is profiling process 8107, command: ./t377
==8107== Profiling application: ./t377
==8107== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.99% 11.817ms 2 5.9086ms 5.8425ms 5.9748ms void gemv2N_kernel_val<float, float, float, int=128, int=8, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
0.01% 1.4720us 1 1.4720us 1.4720us 1.4720us [CUDA memcpy HtoD]
API calls: 73.75% 685.04ms 1 685.04ms 685.04ms 685.04ms cudaFree
24.80% 230.39ms 6 38.398ms 23.754us 229.23ms cudaMalloc
1.27% 11.790ms 1 11.790ms 11.790ms 11.790ms cudaDeviceSynchronize
0.11% 1.0611ms 185 5.7350us 280ns 232.67us cuDeviceGetAttribute
0.02% 166.03us 2 83.013us 78.396us 87.630us cuDeviceTotalMem
0.02% 163.67us 2 81.835us 70.016us 93.655us cuDeviceGetName
0.01% 106.89us 1 106.89us 106.89us 106.89us cudaMemcpy
0.01% 100.25us 2 50.122us 21.209us 79.036us cudaLaunch
0.00% 29.188us 16 1.8240us 1.1690us 6.4150us cudaEventCreateWithFlags
0.00% 11.524us 11 1.0470us 545ns 5.0950us cudaDeviceGetAttribute
0.00% 8.5100us 6 1.4180us 260ns 5.8550us cudaSetupArgument
0.00% 7.3390us 4 1.8340us 350ns 5.1590us cuDeviceGetCount
0.00% 3.6790us 1 3.6790us 3.6790us 3.6790us cudaGetDevice
0.00% 3.6000us 3 1.2000us 540ns 2.4450us cuDeviceGet
0.00% 2.4100us 2 1.2050us 505ns 1.9050us cudaConfigureCall
0.00% 1.2840us 1 1.2840us 1.2840us 1.2840us cuInit
0.00% 1.2800us 2 640ns 540ns 740ns cudaGetLastError
0.00% 900ns 1 900ns 900ns 900ns cuDriverGetVersion
$ nvcc -o t377 t377.cu -lcublas -DROWS=500
$ nvprof ./t377
==8156== NVPROF is profiling process 8156, command: ./t377
==8156== Profiling application: ./t377
==8156== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.99% 22.114ms 2 11.057ms 10.892ms 11.221ms void gemv2N_kernel_val<float, float, float, int=128, int=32, int=4, int=4, int=1>(float, float, cublasGemv2Params_v2<float, float, float>)
0.01% 1.4080us 1 1.4080us 1.4080us 1.4080us [CUDA memcpy HtoD]
API calls: 72.18% 682.47ms 1 682.47ms 682.47ms 682.47ms cudaFree
25.30% 239.23ms 6 39.871ms 23.219us 238.09ms cudaMalloc
2.34% 22.089ms 1 22.089ms 22.089ms 22.089ms cudaDeviceSynchronize
0.12% 1.1013ms 185 5.9520us 275ns 262.87us cuDeviceGetAttribute
0.02% 182.88us 2 91.440us 70.462us 112.42us cuDeviceGetName
0.02% 170.84us 2 85.420us 81.475us 89.365us cuDeviceTotalMem
0.01% 100.56us 2 50.280us 21.879us 78.681us cudaLaunch
0.01% 51.517us 1 51.517us 51.517us 51.517us cudaMemcpy
0.00% 29.463us 16 1.8410us 1.0800us 7.1350us cudaEventCreateWithFlags
0.00% 11.054us 11 1.0040us 540ns 4.7140us cudaDeviceGetAttribute
0.00% 9.3050us 3 3.1010us 545ns 8.2000us cuDeviceGet
0.00% 8.3950us 6 1.3990us 255ns 5.4200us cudaSetupArgument
0.00% 7.5350us 4 1.8830us 325ns 5.0550us cuDeviceGetCount
0.00% 4.1200us 1 4.1200us 4.1200us 4.1200us cudaGetDevice
0.00% 3.2090us 2 1.6040us 485ns 2.7240us cudaConfigureCall
0.00% 1.1050us 2 552ns 475ns 630ns cudaGetLastError
0.00% 1.0350us 1 1.0350us 1.0350us 1.0350us cuDriverGetVersion
0.00% 1.0050us 1 1.0050us 1.0050us 1.0050us cuInit
$
In the above test case, the gemv kernel with 1000 rows takes ~6ms and the gemv kernel with 500 rows takes ~11ms.
Observation is essentially identical under CUDA 10.