Slow Down a little later

I have a question.

I wrote an elementary program. :) The program’s flow is here:

  1. The program prepares two arrays ( A and B ) with random numbers on Host.
  2. It copy two arrays from host to device.
  3. 1 SP in GPU compute C[i] = A[i] * B[i] + 1.0 for each elements.
  4. Program copy the result C from device to host.

In initial 16 steps, computation seems very fast, but after 17th step, computation seems very slow.
I can see immediately on the screen:
|…|…
But next character(.) appear a short time later.

Why do CUDA cause slowdown? How can I improve this program?

Hardware : GeForce 8800GTS with 320 MB memory
OS : Linux Cent OS 4.5

I attach the full source code with this post.

[attachment=3917:attachment]

----device’s kernel code ----------
global void kernelDevice( float* d_a, float* d_b, float* d_c, int nx, int loadLevel){
for(int itr=0; itr< loadLevel; itr++)
for(int i=0;i<nx; i++){
d_c[i] = d_a[i] * d_b[i] + 1.0;
}
__syncthreads();
}

— host side program (partial code)-----
int array_size = 100100100;
int max_itr=100;
int loadLevel=100;

dim3 grid( 1, 1, 1);
dim3 threads( 1, 1, 1);

// execute the kernel
for(int i=0;i<max_itr;i++){
if( !(i%10) ) putchar(’|’); fflush(stdout);
putchar(’.’); fflush(stdout);
kernelDevice<<< grid, threads, 0 >>>( d_a, d_b, d_c, array_size,loadLevel);
}

sample.txt (5.28 KB)

You are using a single processor out of the 96 available.

Take a look at this code to add 2 vectors:
http://forums.nvidia.com/index.php?showtopic=34309

Thank you very much for your fast reply!

As you said, I use single processor and I am not planning to use multi processors in that sample program. :))

I am confused because the program runs smoothly before the 16-th step but calculation speed becomes slow suddenly after the 17-th step. :(

Why do it?

I will check it.

Thanks

Note that kernel launches are asynchronous in CUDA 1.0, so when you print the “.”, the kernel has not necessarily completed. I expect the slowdown after 16 iterations is caused by some internal launch buffer being full.

If you add a call to cudaThreadSynchronize() after each kernel call you’ll get a better idea of how long it’s really taking.

Thank you for your reply.

I try introducing cudaThreadSynchronize() into for-loop in that code

and I found that the dot(.) appeared one by one.

Thanks