Hello,
I have a question regarding cudaThreadSynchronize() and cudaStreamSynchronize(). At the moment I am basically doing the following:
void mySolver()
{
for (int i=0; i<maxSteps; ++i)
{
kernel1<<dimGrid,dimBlock,stream1>>();
kernel2<dimGrid,dimBlock,stream2>>();
kernel3<dimGrid,dimBlock,stream2>>();
kernel4<dimGrid,dimBlock,stream2>>();
kernel5<dimGrid,dimBlock,stream2>>();
cudaStreamSynchronize(stream1);
kernel6<dimGrid,dimBlock,stream2>>();
cudaStreamSynchronize(stream2);
}
}
and I am timing mySolver() with the wall clock time.
startTime = wall_clock();
mySolver();
stopTime = wall_clock();
When I change the parameter maxSteps I am observing the following run times
maxSteps | wall clock time
---------±---------------------
10 | 0.121 ms
20 | 0.202 ms
30 | 0.290 ms
40 | 0.363 ms
50 | 0.442 ms
100 | 0.944 ms
150 | 2.381 ms
200 | 31.513 ms
238 | 53.640 ms
As the computational work does not change during the iteration i can not really explain that behavior. If i add
cudaThreadSynchronize()
at the end of mySolver() the howl thing gets much slower but the behavior regarding the number of iterations is mor as expected. So my Questions are
-
What exactly is doing cudaThreadSyncronize() more than cudaStreamSynchronize()?
- Why does the runtime of the cudaStreamSynchronize() version does not scale with the number of iterations?
Best regards
Jiri Kraus