How can I improve this code which only reduces half time for the same code using MATLAB, thanks!

I am using cublas to do some computation. The size of the vectors I tried is in the range of 1000-3000, and the size of matrices is 1000^2-3000^2. The main program is in a loop. When the difference between two vectors (which are updated during each loop) is less than some tolerance, exist the loop. I am using GT9400 now, and for vector size 1024, timeelapsed from the following code is 0.0160 while the same code in matlab takes 0.031. I would hope GPU has more improvement.

The program starts by computing some matrices:

AB=Cb
D
EFG=Cf
//Where A,B, D,E,F,G only depend on the size of the vector. After I got Cb, Cf, these will be destroyed/
Since Cf and Cb will be called in the loop many times, I was thinking to define them as __constant__float*, but it turns out that the argument in the cublas functions can not be constant variables, so I have to only define them as float*.

C*m=Yold;

//Following are the loop with the kernel functions defined at the end. All the variables with their names start with d are device variables, which are consistent with the requirement in the cublas. This way also reduces the time on transferring the data back to CPU.

cudaThreadSynchronize();
start_time=clock();
for(iterations=1;iterations<maxIteration;iterations++){
FunctionEvaluation<<<Block_Num,Thread_Num,0>>>(d_Yold,d_f);
// This function evaluate the function of each element of the vector, d_f(i)=f(d_Yold(i))

    cublasSgemv('n', (N+1), (N+1), alpha, d_Cf, (N+1), d_f, 1, beta, d_B, 1);  //Cf*f=B
         
    ICcorrected<<<1,1,0>>>(d_B,d_f);///corrected for initial conditions, get d_f, corrected coefficients  

// I need to add a constant to the first element of B (B[0]). Using one kernel doing this, I do not need to return back the data to CPU. But this sounds not an optimal way.

    cublasSgemv('n', (N+1), (N+1), alpha, d_Cb, (N+1), d_f, 1, beta, d_Ynew, 1);  //Cb*f=Ynew

    yDifference<<<Block_Num,Thread_Num,0>>>(d_Ynew,d_Yold,d_e);

//calculate the difference between two vectors

    h_e2=cublasSdot (N+1, d_e, 1,d_e, 1); 

//calculate the norm of the vector. If less than the tolerance, exist the for loop. h_e2 is a variable on CPU as cublasSdot defined. Will this degenerate the performance? I would like some way that the kernel code is running the loop, only exist for some conditions. But I do not know how to do this…

   if(abs(h_e2)<deltaLimit){            
        break;//exist
    }
    else{             
          EvaluateArray<<<1,1,0>>>(d_B,d_f);    //all the elments of B are the same as the elments in f in the same order.
         EvaluateArray<<<1,1,0>>>(d_Yold,d_Ynew) );    //[b]all the elments of Yold are the same as the elments in Ynew in the same order. I tried using one kernel function instead of these two functions but it turned out requiring more time. There should be a good way to improve this part but I am not sure..[/b]
    } 
}

status = cublasGetVector(N+1, sizeof(h_Ynew[0]), d_Ynew, 1, h_Ynew, 1);
cudaThreadSynchronize();
end_time=clock();
timeelapsed=double(end_time- start_time)/CLK_TCK;
//I think this is the time used for the overall loop,

global static void FunctionEvaluation(float y, float f)
{
const int tid=threadIdx.x;
const int bid=blockIdx.x;
float time;
time=cos((bid
Thread_Num +tid)PI/N);
f[bid
Thread_Num +tid]=H/2
cos(H/2time+H/2+epsilony[bid*Thread_Num +tid]);

}

global static void ICcorrected(float *Bold, float Bnew)
{
int i;
for(i=0;i<=N;i++)
Bnew[i]=Bold[i];
Bnew[0]=Bold[0]+2
y0;///
}

global static void yDifference(float *Y1, float *Y2, float *e)
{
const int tid=threadIdx.x;
const int bid=blockIdx.x;

e[bid*Thread_Num +tid]=Y1[bid*Thread_Num +tid]-Y2[bid*Thread_Num +tid];

}

global static void EvaluateArray(float *Y1, float *Y2)
{
int i;
for(i=0;i<=N;i++){
Y1[i]=Y2[i]; }
}

you have many BLAS1 operation, say vector-vector operation, these operations are hard to have

good performance, its upper bound is bandwidth of your card and gain factor is ratio between your

card’s bandwidth and bandwidth of FSB.

you say " I am using GT9400 now, and for vector size 1024", according to SPEC of GT9400, it has

  1. two multiprocessor (16 cores)

  2. 12.8 GB/s

in fact bandwidth of GT9400 is not good, you can use bandwidthTest in SDK to find out practical bandwidth.

suppose you reach maximum bandwidth 12.8GB/s, and and FSB has 2GB/s bandwidth, then you have 6.4x speedup.

I believe that cuBLAS is good but it depends on data size.

you can calibrate performance of cuBLAS for BLAS1 on your GT9400 and then

profile your kernel functions.

I think that you need to optimize your kernel functions.

for exmaple:

(1) FunctionEvaluation<<<Block_Num,Thread_Num,0>>>(d_Yold,d_f);

[codebox]// H_over_2 = H/2 is pre-computed in host code

// pi_over_N = PI/N is is pre-computed in host code

global static void FunctionEvaluation(float *y, float *f,

float H_over_2 )

{

int  index = blockIdx.x*Thread_Num + threadIdx.x ;

float time = cos( index* pi_over_N );

f[index]= H_over_2*cos( H_over_2*time + H_over_2 + epsilon*y[index] ); 

}[/codebox]

(2) ICcorrected<<<Block_Num,Thread_Num,0>>>(d_B,d_f);///corrected for initial conditions, get d_f, corrected coefficients

[codebox]global static void ICcorrected(float *Bold, float *Bnew)

{

int i = blockIdx.x*Thread_Num + threadIdx.x ;

if ( i <= N ){

	if ( 0 == i ){

		Bnew[0]=Bold[0]+2*y0;

	}else{

		Bnew[i]=Bold[i]; 

	}

}

}[/codebox]

  1. yDifference<<<Block_Num,Thread_Num,0>>>(d_Ynew,d_Yold,d_e);

[codebox]global static void yDifference(float *Y1, float *Y2, float *e)

{

int  index = blockIdx.x*Thread_Num + threadIdx.x ;

e[index]=Y1[index]-Y2[index];

}[/codebox]

  1. EvaluateArray<<<Block_Num,Thread_Num,0>>>(d_B,d_f);

EvaluateArray<<<Block_Num,Thread_Num,0>>>(d_Yold,d_Ynew) );

[codebox]global static void EvaluateArray(float *Y1, float *Y2)

{

int i = blockIdx.x*Thread_Num + threadIdx.x ;

if ( i <= N ){

	Y1[i]=Y2[i];

}

}[/codebox]

Hello Lung Sheng Chien,

Thank you so much for your generous reply! This is my first real CUDA code and it looks that I need to study more. One good news is that after I optimized my code as you recommended, time of my GPU code is 28.44% of the matlab code. A big improvement although I hope I can do more-:) I have some questions and would like to hear your suggestions! Thanks!

  1. After I run the band test code, this is what I got: host-to-device: 1496.7MB/S, device-to-host: 1496.7MB/s; device-to-device: 8132.5MB/s. I am not sure how to get the FSB for my computer. Also which number should I use to calculate the gain factor you suggested? Is it weird that my device-to-device bandwidth is only 62.05% of what the specification says?

  2. I am not sure whether the way I calculate the time is right or wrong according to that I would like to compare the overall time in the loop. I would think the time is fixed for a fixed N but it turned out not.

For N=1023, I have got the time as zero and 0.016

For N=2047, I have got 0.031, 0.032, 0.015 and once zero…

For N=3071, I have got 0.062, 0.078, but once it says to “memory free error” for one of the matrices.

  1. What is the reason that you added “if i<=N” in the kernel code? Will there be some exception happen?

Thanks a lot!

Best regards,

Question 1: how to get the FSB for my computer?

ans: you can write a host program, copy a large array, say A → B, then

bandwidth of FSB = (number of byte transfer, read and write)/(elapsed time)

for example, I have a quad-core Q6600 and motherboard: ASUS P5Q PRO, then under single thread,

if I use Intel C++ compiler, then bandwidth of FSB = 2.5~3GB/s

if I use cl.exe of vc2005, then bandwidth of FSB is less than 2GB/s

however theorentical upper bound of bandwidth is 10~12GB/s

Question 2: my device-to-device bandwidth is only 62.05% of what the specification says?

ans: It is O.K. for example I have a GTX295 card, its maximum bandwidth is = 111.888 GB/s, however under bandwidthTest,

bandwidth is about 91GB/s, only 81% of maximum bandwidth.

Question 3: I would think the time is fixed for a fixed N but it turned out not.

ans: to measure accurate time, formally, you need remove warmup time and do average

[codebox]// remove warmup time

kernel<<< grids, threads >>>(...)

cudaThreadSynchronize();

// do average

unsigned int timer;

cutCreateTimer(&timer);

cutStartTimer(timer);	  

for (i = 0; i < numIterations; ++i){

	kernel<<< grids, threads >>>(...)

}

cudaThreadSynchronize();

cutStopTimer(timer);

naiveTime = cutGetTimerValue(timer);

ave_naiveTime = naiveTime / numIterations ;

[/codebox]

Question 4: What is the reason that you added “if i<=N” in the kernel code?

ans: in general, number of total threads may execeed your data size, you need a boundary condition,

the same as what you do in for-loop.

for example, if you have vector A(0:1000) of data size = 1001 and your threads block = 256, then you need

ceil(1001/256) = 4 blocks, this would create 4 x 256 = 1024 threads.

Hello LSChien,

Thanks again for your generous reply! I really appreciate it!

  1. I run a case to copy b[i]=a[i], with size(a)=32M. The average time is 0.1421s, which corresponds to 900.7741MB/s bandwidth. I am using visual C 2005.

  2. According to what you suggested before, since the device-to-device bandwidth I got is 8132.5MB/s, is it right that the most gain I can obtain when I use BLAS1 level Cublas function is 8132.5/900.7741=9? This is not promising. Can you tell me how you know this? Do you have some thoughts on the possible gains I can achieve when I use level 2 and level 3 functions? In this way, I can have some ideas on the possible overall gain I can achieve since one of my friend has a GTX 260 machine which I can use for a short time later this week.

  3. You said earlier that I can calibrate performance of cuBLAS for BLAS1 on your GT9400 and then profile your kernel functions. Can you give me a little more hint on what you mean by calibrating and profiling?

  4. I will do the average way for time computation as you suggested.

Thanks!

  1. I run a case to copy b[i]=a[i], with size(a)=32M. The average time is 0.1421s, which corresponds to 900.7741MB/s bandwidth.

if type of matrix a, b is “float”, then you have 1.8 GB/s, not 900.7741MB/s

32 M x (4 byte per float) x 2 (read from a and write to B) / 0.1421s = 1.8GB/s

BLAS 1 is vector operation, it is memory-bound problem, you can test the same operation in GTX260, you would have

bigger speedup.

Question: Do you have some thoughts on the possible gains I can achieve when I use level 2 and level 3 functions?

ans: BLAS 3 relates to matrix-matrix multiplication, it is not pure memory-bound, I can show you example that I test

platform: winxp pro64, vc2005 with icpc 11.1.035 -O2, Tesla C1060, driver 190.38, cuda 2.3

experiment: compute C = A*B, A, B, C are square matrix of dimension N x N

d2h : device to host transfer

kernel: call kernel function to compute C = A*B

h2d : gost to device transfer

[codebox]------±------------±--------------±----------------------±----------+

  |             | CPU           |     GPU (ms)          |           |

N | size(A+B+C) | block version |-------±-------±-----+ CPU/kerel |

  | MB          | (ms)          | h2d   | kernel | d2h  |           |

------±------------±--------------±------±-------±-----±----------+

2048 | 96 | 9938 | 31 | 219 | 31 | 45.4 |

------±------------±--------------±------±-------±-----±----------+

4096 | 384 | 82016 | 109 | 1813 | 93 | 45.2 |

------±------------±--------------±------±-------±-----±----------+

5600 | 717.8 | 177532 | 203 | 5453 | 313 | 32.6 |

------±------------±--------------±------±-------±-----±----------+[/codebox]

you can test cublasSgemm on your machine or on GTX260 machine.

Question: Can you give me a little more hint on what you mean by calibrating and profiling?

ans: I mean timing profiling, you can write code to test performance of cublas and your kernel function,

then you will know the difference between memory-bound and computational-bound