Large Inner Dimension Matmul kernel help required

Hey all,
I’m having trouble reaching anywhere near peak for the following problem

let X = matrix(10^6, 100)
let v = matrix(10^6, 2)
u = X’ * v
or in other words, a matmul kernel with a dominant inner dimension. The current appraoch I have to calculating it is setting 1 block per column of X and calculating a respective dot product. This gets me to 50% of peak throughput on a K20x or 65% on a 2090. While this is far surpassing the current blas gemv code which gets 5% of peak, I’m hoping some of you could help me to squeeze out a bit more performance, in particular with the new k20x functionality.

Here’s the current code (which, if you might have noticed, is adapted from the threadFenceReduction kernel in the examples of the cuda toolkit:

__global__ void 
reduceSinglePass(double * g_idata, double *g_odata, const double * __restrict g_vecdata, unsigned int dimx, unsigned int dimy, unsigned int dimz)
{
    extern __shared__ double sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int read = 0;
    unsigned int off1  = blockIdx.x * dimx + threadIdx.x;
    unsigned int off2  = blockIdx.y * dimx + threadIdx.x;
    unsigned int grid = 2*blockSize;
    double mySum = 0;
    while (read+tid < dimx)
    {
        mySum += (g_idata[read+off1]*g_vecdata[read+off2]);
		if (read+blockSize+tid<dimx)
		{
		mySum += (g_idata[read+off1+blockSize]*g_vecdata[read+off2+blockSize]);
		}
        read+=grid;
    }
    sdata[tid] = mySum;
    __syncthreads();

    if (blockSize >= 512)
    {
        if (tid < 256)
        {
            sdata[tid] = mySum = mySum + sdata[tid + 256];
        }
        __syncthreads();
    }
    if (blockSize >= 256)
    {
        if (tid < 128)
        {
            sdata[tid] = mySum = mySum + sdata[tid + 128];
        }
        __syncthreads();
    }
    if (blockSize >= 128)
    {
        if (tid <  64)
        {
            sdata[tid] = mySum = mySum + sdata[tid +  64];
        }
        __syncthreads();
    }
    volatile double *vdata = sdata;
    if (tid < 32)
    {
        vdata[tid] = mySum = mySum + vdata[tid + 32];
        vdata[tid] = mySum = mySum + vdata[tid + 16];
        vdata[tid] = mySum = mySum + vdata[tid +  8];
        vdata[tid] = mySum = mySum + vdata[tid +  4];
        vdata[tid] = mySum = mySum + vdata[tid +  2];
        vdata[tid] = mySum = mySum + vdata[tid +  1];
    }

if (tid == 0) g_odata[blockIdx.x + dimy * blockIdx.y] = sdata[0];
}

here dimx=10^6, dimy=100, dimz=2
g_idata = X (not the transpose)
g_vecdata = v
g_odata = u

If anyone’s looking to make some spare change (not necessarily just spare change :P) I am happy to pay for any consultation services offered, and I refer you to my job posting:
https://devtalk.nvidia.com/default/topic/532890/gpu-computing-jobs/custom-built-matmul-kernels-for-k20x/