cudaThreadSynchronize() vs. cudaStreamSynchronize

jirikraus · January 19, 2010, 12:50pm

Hello,

I have a question regarding cudaThreadSynchronize() and cudaStreamSynchronize(). At the moment I am basically doing the following:

void mySolver()

{

  for (int i=0; i<maxSteps; ++i)

  {

	kernel1<<dimGrid,dimBlock,stream1>>();

	kernel2<dimGrid,dimBlock,stream2>>();

	kernel3<dimGrid,dimBlock,stream2>>();

	kernel4<dimGrid,dimBlock,stream2>>();

	kernel5<dimGrid,dimBlock,stream2>>();

	cudaStreamSynchronize(stream1);

	kernel6<dimGrid,dimBlock,stream2>>();

	cudaStreamSynchronize(stream2);

  }

}

and I am timing mySolver() with the wall clock time.

startTime = wall_clock();

mySolver();

stopTime = wall_clock();

When I change the parameter maxSteps I am observing the following run times

maxSteps | wall clock time

---------±---------------------

10 | 0.121 ms

20 | 0.202 ms

30 | 0.290 ms

40 | 0.363 ms

50 | 0.442 ms

100 | 0.944 ms

150 | 2.381 ms

200 | 31.513 ms

238 | 53.640 ms

As the computational work does not change during the iteration i can not really explain that behavior. If i add

cudaThreadSynchronize()

at the end of mySolver() the howl thing gets much slower but the behavior regarding the number of iterations is mor as expected. So my Questions are

What exactly is doing cudaThreadSyncronize() more than cudaStreamSynchronize()?
- Why does the runtime of the cudaStreamSynchronize() version does not scale with the number of iterations?

Best regards

Jiri Kraus