I have a problem when calling CUDA kernel for several time! I just want to test the speed of my program! Running the kernel only one time is very fast, so i put it into a loop and get everage time (like aligned type sample in SDK)
int nLoop = 8;
for(int i = 0;i < nLoop;++i)
{
// My kernel here
}
Every thing is OK, but when i set nLoop with number >= 9, the speed reduce from 6000 fps → 347 fps (O_O)!!!
Here is all values of my testing:
nLoop = 1 : ~ 41,000 fps
nLoop = 2 : ~ 21,000 fps
nLoop = 4 : ~ 10,500 fps
nLoop = 8 : ~ 6100 fps
nLoop = 9 : ~ 347 fps
I dont know why does the speed reduce too much like that! Please help me!
Thank!
Note: Oh, i have the same problem when i replace my kernel with cudaMemset() too!
int nLoop = 8;
for(int i = 0;i < nLoop;++i)
{
cudaMemset();
}
Hi RLight.
I don’t know why but I think that you can use this statement to check your function to confirm that your function is always correct(no error).
int nLoop = 8;
for(int i = 0;i < nLoop;++i)
{
// My kernel here printf(“CUDA error: %s\n”, cudaGetErrorString(cudaGetLastError()));
}
int nLoop = 8;
for(int i = 0;i < nLoop;++i)
{
cudaMemset(); printf(“CUDA error: %s\n”, cudaGetErrorString(cudaGetLastError()));
}
It’s likely your timing loop. Kernel launches are asynchronous and get queued up.
You probably need a cudaThreadSynchronize() call to make sure all the launched kernels have completed before you close the timing.
The timing increases noticeably in your loop because the stream of kernels has a finite size, so you’re probably hitting that limit and getting some auto synchronization for a few of the first launches.
Thank SPWorley, i changed the code and got about 3120 fps this time! (over 40,000 fps is imposible ^^)!
But i still have problem, with nLoop = 9, the speed is about 124 fps (I expect it is ~340 fps))! If i remove cudaThreadSynchronize(), speed increase about 10%! But it’s still slow, i guess there’s something wrong here! i’ll check the code again to make sure nothing wrong!