I have the following code, which basically calculates d_s s.t. d_x=d_Q*d_s, by iterating the following two equations:
d_e = d_x - d_Q*d_s
d_s = d_s + alpha * d_W * d_e
where d_W is the transpose of d_Q
for (i = 0; i < 6; ++i){
cutStartTimer( timer);
cublasScopy(DIMX, d_x, 1, d_e, 1);
cublasSgemv('N', DIMX, DIMS, -1.0, d_Q, DIMX, d_s, 1, 1.0, d_e, 1);
cublasSgemv('N', DIMS, DIMX, alpha, d_W, DIMS, d_e, 1, 1.0, d_s, 1);
cutStopTimer( timer);
}
If the iteration number is below six then the average processing time for the 3 functions is around 0.05 ms. When I increase the iteration number to 100 the average processing time becomes 1.5 ms per iteration. (DIMX=161024, DIMS=1616).
If I omit the 3rd cublas function call then the slowdown rate is much smaller.
I’m using a GF8800GTX with clock and memory speed decreased to minimum, if I reset to default values the same slowdown effect occurs. I use the graphic card for the display too, but that means only a 5s limit, if I’m right.
Do you have any idea why is this slowdown happening when I increase the iteration number?