Question on CUBLAS::sgemv implementation

Hello,

I am just learning CUDA and I am looking through the code for CUBLAS sgemv.cu for reference. When the code calculates the final dot product of a row, it does so in chunks of 6 elements at a time and then calculates the remaining elements:

while (jj < (jjLimit - 5)) {

    sdot += parms.A[idx + 0*incr] * XX[jj+ 0];

    sdot += parms.A[idx + 1*incr] * XX[jj+ 1];

    sdot += parms.A[idx + 2*incr] * XX[jj+ 2];

    sdot += parms.A[idx + 3*incr] * XX[jj+ 3];

    sdot += parms.A[idx + 4*incr] * XX[jj+ 4];

    sdot += parms.A[idx + 5*incr] * XX[jj+ 5];

    jj   += 6;

    idx  += 6 * incr;

}

while (jj < jjLimit) {

    sdot += parms.A[idx + 0*incr] * XX[jj+ 0];

    jj   += 1;

    idx  += 1 * incr;

}

Why does the implementation work in the chunks of 6? Why not just iterate through the second while loop through all the values?

Thanks!

Chad

I think that this is equivalent to register blocking

float regA[6];

float regx[6];

while (jj < (jjLimit - 5)) {

    #pragma unroll

    for(int j = 0 ; j < 6 ; j++){

        regA[j] = parms.A[idx + j*incr];

        regx[j] = XX[jj + j];     

    } 

    #pragma unroll

    for(int j = 0 ; j < 6 ; j++){

        sdot += regA[j] * regx[j] ; 

    }      

    jj   += 6;    

    idx  += 6 * incr;

}

register blocking is a useful technique to hide memory latency.

you can refer to Volkov’s talk on GTC2010, Better Performance at Lower Occupancy

Ah I see. Thank you for the resources!