I have a question.
I wrote an elementary program. :) The program’s flow is here:
- The program prepares two arrays ( A and B ) with random numbers on Host.
- It copy two arrays from host to device.
- 1 SP in GPU compute C[i] = A[i] * B[i] + 1.0 for each elements.
- Program copy the result C from device to host.
In initial 16 steps, computation seems very fast, but after 17th step, computation seems very slow.
I can see immediately on the screen:
|…|…
But next character(.) appear a short time later.
Why do CUDA cause slowdown? How can I improve this program?
Hardware : GeForce 8800GTS with 320 MB memory
OS : Linux Cent OS 4.5
I attach the full source code with this post.
[attachment=3917:attachment]
----device’s kernel code ----------
global void kernelDevice( float* d_a, float* d_b, float* d_c, int nx, int loadLevel){
for(int itr=0; itr< loadLevel; itr++)
for(int i=0;i<nx; i++){
d_c[i] = d_a[i] * d_b[i] + 1.0;
}
__syncthreads();
}
— host side program (partial code)-----
int array_size = 100100100;
int max_itr=100;
int loadLevel=100;
dim3 grid( 1, 1, 1);
dim3 threads( 1, 1, 1);
// execute the kernel
for(int i=0;i<max_itr;i++){
if( !(i%10) ) putchar(‘|’); fflush(stdout);
putchar(‘.’); fflush(stdout);
kernelDevice<<< grid, threads, 0 >>>( d_a, d_b, d_c, array_size,loadLevel);
}
sample.txt (5.28 KB)