Hello,
I am just learning CUDA and I am looking through the code for CUBLAS sgemv.cu for reference. When the code calculates the final dot product of a row, it does so in chunks of 6 elements at a time and then calculates the remaining elements:
while (jj < (jjLimit - 5)) {
sdot += parms.A[idx + 0*incr] * XX[jj+ 0];
sdot += parms.A[idx + 1*incr] * XX[jj+ 1];
sdot += parms.A[idx + 2*incr] * XX[jj+ 2];
sdot += parms.A[idx + 3*incr] * XX[jj+ 3];
sdot += parms.A[idx + 4*incr] * XX[jj+ 4];
sdot += parms.A[idx + 5*incr] * XX[jj+ 5];
jj += 6;
idx += 6 * incr;
}
while (jj < jjLimit) {
sdot += parms.A[idx + 0*incr] * XX[jj+ 0];
jj += 1;
idx += 1 * incr;
}
Why does the implementation work in the chunks of 6? Why not just iterate through the second while loop through all the values?
Thanks!
Chad