# Too slow...why?

Hello to everyone,

I wrote the below code and it seems to be slow…Can you find some bottleneck? Thanks!

[codebox]while ( (n<n_max+1) && ( (norm_rMR/norm_r0) > tol) ){

``````	n++;

//-Calculate Lanczos Vectors

// alpha = v'*A*v; v=v_hat/beta

cublasSsbmv('U', N, K, 1/(beta), Ab, (K+1), v_hat, 1, 0, sup0, 1); // sup0= A*v;

alpha= cublasSdot(N, v_hat, 1,sup0, 1)*(1/beta); // alpha = dot(v,sup0)=v*sup0

beta_old = beta;				// beta_old = beta;

lanczos<<<numBlocks, threadsPerBlock>>>(sup0, v_old, v_hat, v, alpha, beta, N);

beta = cublasSnrm2(N,v_hat,1);  // beta = norm(v_hat);

//-Calculate QR Factors

c_oold = c_old;

c_old  = c;

s_oold = s_old;

s_old  = s;

r1_hat = c_old*alpha - c_oold*s_old*beta_old;

r1     = sqrt( r1_hat*r1_hat + beta*beta );

r2	   = s_old*alpha + c_oold*c_old*beta_old;

r3     = s_oold*beta_old;

//-Calculate New Givens Rotations

c = r1_hat/r1;

s = beta/r1;

//-Update Solution

update<<<numBlocks, threadsPerBlock>>>(w_oold, w_old, w, v, xMR, r1,r2, r3, c, eta,N);

eta = -s*eta;

}
``````

global void update(float* w_oold, float* w_old, float* w, float* v, float* xMR, float r1, float r2, float r3, float c, float eta,int N) {

``````int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i<N){

w_oold[i] = w_old[i];

w_old[i] = w[i];

w[i] = (v[i] - r3*w_oold[i] -r2*w_old[i])/r1;

xMR[i] = xMR[i] + c*eta*w[i];

}
``````

}

global void lanczos(float* Av, float* v_old, float* v_hat, float* v, float alpha, float beta, int N) {

``````int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i<N){

v_old[i] = v[i];

v[i] = v_hat[i]/beta;

v_hat[i] = Av[i] - alpha*v[i] - beta*v_old[i];

}
``````

}[/codebox]

And waht is your block size and block num?

And waht is your block size and block num?

Sorry, the number of blocks and size are:

Sorry, the number of blocks and size are:

100 threads per block is wasting 22% of the capacity of the gpu. The warp size on all gpus in 32 - so threads per block shoud be a round multiple of 32. 96 or 128 should be noticeably faster without doing anything else.

The other obvious thing to do would be to get rid of the sdot call and add the operation to your own kernel.

100 threads per block is wasting 22% of the capacity of the gpu. The warp size on all gpus in 32 - so threads per block shoud be a round multiple of 32. 96 or 128 should be noticeably faster without doing anything else.

The other obvious thing to do would be to get rid of the sdot call and add the operation to your own kernel.

Thank you very much! The block size was a mistake :(…but I already planned to move sdot inside my kernel.

Let me know if you find some other possible bottleneck.

Thanks

Thank you very much! The block size was a mistake :(…but I already planned to move sdot inside my kernel.

Let me know if you find some other possible bottleneck.

Thanks

Need to know what function is most time consumer. Use profiler etc. Or measue by timers.
I noticed division which should be replaced by multiplication but I am not sure if it is bottle neck.

Need to know what function is most time consumer. Use profiler etc. Or measue by timers.
I noticed division which should be replaced by multiplication but I am not sure if it is bottle neck.

This is what the profiler show using 128 threads per block…

[codebox]

Method #call |GPU us |%GPU time| throughput

ssbmvu_main 11 729.28 59.83 0.384969

snrm2_gld_main 11 189.568 15.55 0.601414

sdot_gld_main 10 67.52 5.54 0.819431

update 10 44.736 3.67 0.101733

lanczos 10 39.424 3.23 0.114989

scopy_gld_main 2 7.424 0.6 0.163793

memcpyHtoD 3 46.72 3.83

memcpyDtoH 21 94.048 7.71

[/codebox]

This is what the profiler show using 128 threads per block…

[codebox]

Method #call |GPU us |%GPU time| throughput

ssbmvu_main 11 729.28 59.83 0.384969

snrm2_gld_main 11 189.568 15.55 0.601414

sdot_gld_main 10 67.52 5.54 0.819431

update 10 44.736 3.67 0.101733

lanczos 10 39.424 3.23 0.114989

scopy_gld_main 2 7.424 0.6 0.163793

memcpyHtoD 3 46.72 3.83

memcpyDtoH 21 94.048 7.71

[/codebox]

Someone with more expirence with visual profiler should tell, as far as I see too first libriary functions are bottleneck.
cublasSsbmv. But what is snrm2_gld_main I do not know. Looks like it is cublas issue. Need to check performance of these functions. And check if you linked them right way. I suspect that snrm2_gld_main maybe “gold” implementation.

Someone with more expirence with visual profiler should tell, as far as I see too first libriary functions are bottleneck.
cublasSsbmv. But what is snrm2_gld_main I do not know. Looks like it is cublas issue. Need to check performance of these functions. And check if you linked them right way. I suspect that snrm2_gld_main maybe “gold” implementation.

You can try to implement cublasSsbmv by yourself, maybe you can do it better for your task.

You can try to implement cublasSsbmv by yourself, maybe you can do it better for your task.

cublaSsbmv perform matrix-vector product and it is a cublas function, I hope it is optimal!

Instead function snrm2_gld_main is just a vector norm and it takes to much time…but it is a cublas function as well!

I was thinking cublas efficient :( !

Should I implement all function by myself??

cublaSsbmv perform matrix-vector product and it is a cublas function, I hope it is optimal!

Instead function snrm2_gld_main is just a vector norm and it takes to much time…but it is a cublas function as well!

I was thinking cublas efficient :( !

Should I implement all function by myself??

Maybe your cublas is spoiled somehow. Try to check performance of these functions stand alone. And why do you think they are slow? Matrix-vector multiplication should be fast, only one thing I can suggest that matrix has bad column layout. Btw, are you sure you are using float routine?