Slow Down a little later

soa · July 26, 2007, 7:02pm

I have a question.

I wrote an elementary program. :) The program’s flow is here:

The program prepares two arrays ( A and B ) with random numbers on Host.
It copy two arrays from host to device.
1 SP in GPU compute C[i] = A[i] * B[i] + 1.0 for each elements.
Program copy the result C from device to host.

In initial 16 steps, computation seems very fast, but after 17th step, computation seems very slow.
I can see immediately on the screen:
|…|…
But next character(.) appear a short time later.

Why do CUDA cause slowdown? How can I improve this program?

Hardware : GeForce 8800GTS with 320 MB memory
OS : Linux Cent OS 4.5

I attach the full source code with this post.

[attachment=3917:attachment]

----device’s kernel code ----------
global void kernelDevice( float* d_a, float* d_b, float* d_c, int nx, int loadLevel){
for(int itr=0; itr< loadLevel; itr++)
for(int i=0;i<nx; i++){
d_c[i] = d_a[i] * d_b[i] + 1.0;
}
__syncthreads();
}

— host side program (partial code)-----
int array_size = 100100100;
int max_itr=100;
int loadLevel=100;

dim3 grid( 1, 1, 1);
dim3 threads( 1, 1, 1);

// execute the kernel
for(int i=0;i<max_itr;i++){
if( !(i%10) ) putchar(‘|’); fflush(stdout);
putchar(‘.’); fflush(stdout);
kernelDevice<<< grid, threads, 0 >>>( d_a, d_b, d_c, array_size,loadLevel);
}

sample.txt (5.28 KB)

mfatica · July 26, 2007, 8:16pm

You are using a single processor out of the 96 available.

Take a look at this code to add 2 vectors:
[url=“http://forums.nvidia.com/index.php?showtopic=34309”]The Official NVIDIA Forums | NVIDIA

soa · July 27, 2007, 7:25am

Thank you very much for your fast reply!

As you said, I use single processor and I am not planning to use multi processors in that sample program. External Media

I am confused because the program runs smoothly before the 16-th step but calculation speed becomes slow suddenly after the 17-th step. :(

Why do it?

I will check it.

Thanks

Simon_Green · July 27, 2007, 12:46pm

Note that kernel launches are asynchronous in CUDA 1.0, so when you print the “.”, the kernel has not necessarily completed. I expect the slowdown after 16 iterations is caused by some internal launch buffer being full.

If you add a call to cudaThreadSynchronize() after each kernel call you’ll get a better idea of how long it’s really taking.

soa · July 30, 2007, 9:16am

Thank you for your reply.

I try introducing cudaThreadSynchronize() into for-loop in that code

and I found that the dot(.) appeared one by one.

Thanks

Topic		Replies	Views
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9924	February 8, 2008
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20317	May 4, 2007
Unusual delays does anyone recognize this pattern... CUDA Programming and Performance	9	1773	May 7, 2009
Speed reduces 17 -> 20 times after the kernel is called 9th times! T_T! CUDA Programming and Performance	4	2512	November 18, 2008
Communication Delay Factors! What is the significant factor? CUDA Programming and Performance	3	6923	July 26, 2007
speed not stable,and performance lost Maybe a HUGE bug CUDA Programming and Performance	6	10005	November 29, 2007
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6285	May 26, 2009
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5404	August 17, 2009
Odd performance problem/question CUDA Programming and Performance	3	867	June 3, 2009
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23467	October 7, 2010

Slow Down a little later

----device’s kernel code ---------- global void kernelDevice( float* d_a, float* d_b, float* d_c, int nx, int loadLevel){ for(int itr=0; itr< loadLevel; itr++) for(int i=0;i<nx; i++){ d_c[i] = d_a[i] * d_b[i] + 1.0; } __syncthreads(); }

// execute the kernel for(int i=0;i<max_itr;i++){ if( !(i%10) ) putchar(‘|’); fflush(stdout); putchar(‘.’); fflush(stdout); kernelDevice<<< grid, threads, 0 >>>( d_a, d_b, d_c, array_size,loadLevel); }

Related topics

----device’s kernel code ----------
global void kernelDevice( float* d_a, float* d_b, float* d_c, int nx, int loadLevel){
for(int itr=0; itr< loadLevel; itr++)
for(int i=0;i<nx; i++){
d_c[i] = d_a[i] * d_b[i] + 1.0;
}
__syncthreads();
}

// execute the kernel
for(int i=0;i<max_itr;i++){
if( !(i%10) ) putchar(‘|’); fflush(stdout);
putchar(‘.’); fflush(stdout);
kernelDevice<<< grid, threads, 0 >>>( d_a, d_b, d_c, array_size,loadLevel);
}