Speed reduces 17 -> 20 times after the kernel is called 9th times! T_T!

RLight · November 14, 2008, 6:26am

Hi everyone,

I have a problem when calling CUDA kernel for several time! I just want to test the speed of my program! Running the kernel only one time is very fast, so i put it into a loop and get everage time (like aligned type sample in SDK)

int nLoop = 8;

for(int i = 0;i < nLoop;++i)

{

// My kernel here

}

Every thing is OK, but when i set nLoop with number >= 9, the speed reduce from 6000 fps → 347 fps (O_O)!!!

Here is all values of my testing:

nLoop = 1 : ~ 41,000 fps

nLoop = 2 : ~ 21,000 fps

nLoop = 4 : ~ 10,500 fps

nLoop = 8 : ~ 6100 fps

nLoop = 9 : ~ 347 fps

I dont know why does the speed reduce too much like that! Please help me!

Thank!

Note: Oh, i have the same problem when i replace my kernel with cudaMemset() too!

int nLoop = 8;

for(int i = 0;i < nLoop;++i)

{

cudaMemset();

}

I’m so confuse ???___???!!!

Quoc_Vinh · November 14, 2008, 9:12am

Hi RLight.
I don’t know why but I think that you can use this statement to check your function to confirm that your function is always correct(no error).
int nLoop = 8;
for(int i = 0;i < nLoop;++i)
{
// My kernel here
printf(“CUDA error: %s\n”, cudaGetErrorString(cudaGetLastError()));
}

int nLoop = 8;
for(int i = 0;i < nLoop;++i)
{
cudaMemset();
printf(“CUDA error: %s\n”, cudaGetErrorString(cudaGetLastError()));
}

SPWorley · November 14, 2008, 10:09am

It’s likely your timing loop. Kernel launches are asynchronous and get queued up.
You probably need a cudaThreadSynchronize() call to make sure all the launched kernels have completed before you close the timing.

The timing increases noticeably in your loop because the stream of kernels has a finite size, so you’re probably hitting that limit and getting some auto synchronization for a few of the first launches.

RLight · November 15, 2008, 5:16pm

Thank SPWorley, i changed the code and got about 3120 fps this time! (over 40,000 fps is imposible ^^)!

But i still have problem, with nLoop = 9, the speed is about 124 fps (I expect it is ~340 fps))! If i remove cudaThreadSynchronize(), speed increase about 10%! But it’s still slow, i guess there’s something wrong here! i’ll check the code again to make sure nothing wrong!

Thank for your help!

Sarnath · November 18, 2008, 5:33am

I suspect the code that does timing… Can you publish that code as well?

Topic		Replies	Views
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9951	February 8, 2008
kernel in loop (time explodes) CUDA Programming and Performance	4	3553	June 29, 2009
Strange Runtime behavior CUDA Programming and Performance	7	3182	December 18, 2009
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6315	May 26, 2009
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10702	June 21, 2009
Strange Performance Issues Strange Performance Issues at the First Kernel Execution CUDA Programming and Performance	1	882	August 8, 2009
Kernel Timing and cudaThreadSynchronize() CUDA Programming and Performance	6	2099	July 30, 2010
Time of cudaLaunch increase with the times of calling kernels. CUDA Programming and Performance	7	1253	September 12, 2017
Kernels and For Loops CUDA Programming and Performance	2	4131	April 4, 2008
Slow Down a little later CUDA Programming and Performance	4	5337	July 30, 2007

Speed reduces 17 -> 20 times after the kernel is called 9th times! T_T!

Related topics