Speed reduces 17 -> 20 times after the kernel is called 9th times! T_T!

Hi everyone,

I have a problem when calling CUDA kernel for several time! I just want to test the speed of my program! Running the kernel only one time is very fast, so i put it into a loop and get everage time (like aligned type sample in SDK)

int nLoop = 8;

for(int i = 0;i < nLoop;++i)

{

// My kernel here

}

Every thing is OK, but when i set nLoop with number >= 9, the speed reduce from 6000 fps -> 347 fps (O_O)!!!

Here is all values of my testing:

nLoop = 1 : ~ 41,000 fps

nLoop = 2 : ~ 21,000 fps

nLoop = 4 : ~ 10,500 fps

nLoop = 8 : ~ 6100 fps

nLoop = 9 : ~ 347 fps

I dont know why does the speed reduce too much like that! Please help me!

Thank!

Note: Oh, i have the same problem when i replace my kernel with cudaMemset() too!

int nLoop = 8;

for(int i = 0;i < nLoop;++i)

{

cudaMemset();

}

I’m so confuse ???___???!!!

Hi RLight.
I don’t know why but I think that you can use this statement to check your function to confirm that your function is always correct(no error).
int nLoop = 8;
for(int i = 0;i < nLoop;++i)
{
// My kernel here
printf(“CUDA error: %s\n”, cudaGetErrorString(cudaGetLastError()));
}

int nLoop = 8;
for(int i = 0;i < nLoop;++i)
{
cudaMemset();
printf(“CUDA error: %s\n”, cudaGetErrorString(cudaGetLastError()));
}

It’s likely your timing loop. Kernel launches are asynchronous and get queued up.
You probably need a cudaThreadSynchronize() call to make sure all the launched kernels have completed before you close the timing.

The timing increases noticeably in your loop because the stream of kernels has a finite size, so you’re probably hitting that limit and getting some auto synchronization for a few of the first launches.

Thank SPWorley, i changed the code and got about 3120 fps this time! (over 40,000 fps is imposible ^^)!

But i still have problem, with nLoop = 9, the speed is about 124 fps (I expect it is ~340 fps))! If i remove cudaThreadSynchronize(), speed increase about 10%! But it’s still slow, i guess there’s something wrong here! i’ll check the code again to make sure nothing wrong!

Thank for your help!

I suspect the code that does timing… Can you publish that code as well?