cudaThreadSynchronize() vs. cudaStreamSynchronize

Hello,

I have a question regarding cudaThreadSynchronize() and cudaStreamSynchronize(). At the moment I am basically doing the following:

void mySolver()

{

  for (int i=0; i<maxSteps; ++i)

  {

	kernel1<<dimGrid,dimBlock,stream1>>();

	kernel2<dimGrid,dimBlock,stream2>>();

	kernel3<dimGrid,dimBlock,stream2>>();

	kernel4<dimGrid,dimBlock,stream2>>();

	kernel5<dimGrid,dimBlock,stream2>>();

	cudaStreamSynchronize(stream1);

	kernel6<dimGrid,dimBlock,stream2>>();

	cudaStreamSynchronize(stream2);

  }

}

and I am timing mySolver() with the wall clock time.

startTime = wall_clock();

mySolver();

stopTime = wall_clock();

When I change the parameter maxSteps I am observing the following run times

maxSteps | wall clock time

---------±---------------------

10 | 0.121 ms

20 | 0.202 ms

30 | 0.290 ms

40 | 0.363 ms

50 | 0.442 ms

100 | 0.944 ms

150 | 2.381 ms

200 | 31.513 ms

238 | 53.640 ms

As the computational work does not change during the iteration i can not really explain that behavior. If i add

cudaThreadSynchronize()

at the end of mySolver() the howl thing gets much slower but the behavior regarding the number of iterations is mor as expected. So my Questions are

  • What exactly is doing cudaThreadSyncronize() more than cudaStreamSynchronize()?

    • Why does the runtime of the cudaStreamSynchronize() version does not scale with the number of iterations?

Best regards

Jiri Kraus