# Question on CUBLAS::sgemv implementation

Hello,

I am just learning CUDA and I am looking through the code for CUBLAS sgemv.cu for reference. When the code calculates the final dot product of a row, it does so in chunks of 6 elements at a time and then calculates the remaining elements:

``````while (jj < (jjLimit - 5)) {

sdot += parms.A[idx + 0*incr] * XX[jj+ 0];

sdot += parms.A[idx + 1*incr] * XX[jj+ 1];

sdot += parms.A[idx + 2*incr] * XX[jj+ 2];

sdot += parms.A[idx + 3*incr] * XX[jj+ 3];

sdot += parms.A[idx + 4*incr] * XX[jj+ 4];

sdot += parms.A[idx + 5*incr] * XX[jj+ 5];

jj   += 6;

idx  += 6 * incr;

}

while (jj < jjLimit) {

sdot += parms.A[idx + 0*incr] * XX[jj+ 0];

jj   += 1;

idx  += 1 * incr;

}
``````

Why does the implementation work in the chunks of 6? Why not just iterate through the second while loop through all the values?

Thanks!

I think that this is equivalent to register blocking

``````float regA[6];

float regx[6];

while (jj < (jjLimit - 5)) {

#pragma unroll

for(int j = 0 ; j < 6 ; j++){

regA[j] = parms.A[idx + j*incr];

regx[j] = XX[jj + j];

}

#pragma unroll

for(int j = 0 ; j < 6 ; j++){

sdot += regA[j] * regx[j] ;

}

jj   += 6;

idx  += 6 * incr;

}
``````

register blocking is a useful technique to hide memory latency.

you can refer to Volkov’s talk on GTC2010, Better Performance at Lower Occupancy

http://www.nvidia.com/object/gtc2010-presentation-archive.html

Ah I see. Thank you for the resources!