I have this question after reading a code example from the A Multi-Stream Example. The code is as below.
const int N = 1 << 20;
__global__ void kernel(float *x, int n)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
x[i] = sqrt(pow(3.14159,i));
}
}
int main()
{
const int num_streams = 8;
cudaStream_t streams[num_streams];
float *data[num_streams];
for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);
cudaMalloc(&data[i], N * sizeof(float));
// launch one worker kernel per stream
kernel<<<1, 64, 0, streams[i]>>>(data[i], N);
// launch a dummy kernel on the default stream
kernel<<<1, 1>>>(0, 0);
}
cudaDeviceReset();
return 0;
}
As we see, it calls cudaDeviceReset(); at the end of the program.
Then the author compiles the code and uses the tool nvvp to profile all kernel executions. So I suppose after calling cudaDeviceReset();, all operations(at least kernels) on the device should be completed. But the documentation doesn’t mention cudaDeviceReset(); would wait for the completion of kernels or any operations on the device.
So here is my question:
Will cudaDeviceReset() wait for the completion of operations on the device?
It seems evident that it does, at least in that case there. Does that constitute a guarantee always and forever? I don’t know.
It also seems evident that that is not described in the documentation.
For me, personally, when I run into a situation like this, and I have a suitable alternative which is documented to have the desired semantics, namely cudaDeviceSynchronize(), I prefer to use the one that is documented.
cudaDeviceSynchronize() will wait for the completion of operations on the device. Specifically the documentation states:
Blocks until the device has completed all preceding requested tasks.
Another possible option is to file a bug asking for a doc clarification.
Following your recommendation, I try to use cudaDeviceSynchronize() but get another question. Please tell me to open another topic if needed.
The description of cudaDeviceSynchronize() in the documentation states:
Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.
I am confused about the usage of the flag cudaDeviceScheduleBlockingSync. (I have read StackOverflow and another forum topic, but I hope to confirm it.)
My understanding is :
Without cudaDeviceScheduleBlockingSync, the host thread blocks until the device has completed all preceding requested tasks . The key is preceding requested tasks rather than all tasks.
With cudaDeviceScheduleBlockingSync, the host thread blocks until the device has completed all tasks.
For instance, when the host thread(threadA) calls cudaDeviceSynchronize() , there are already taskA and taskB waiting for execution on the device#0. Then, after the threadA is blocked, another host thread(threadB) issues taskC to the device#0.
In this case,
Without cudaDeviceScheduleBlockingSync, threadA blocks until taskA, taskB are completed.
With cudaDeviceScheduleBlockingSync, threadA blocks until taskA, taskB, and taskC are completed.
No I don’t think your understanding is correct. The definition that I know of given for cudaDeviceScheduleBlockingSync is given here:
Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work.
And we’ve already covered the definition of cudaDeviceSynchronize(). It doesn’t say anything about anything other than preceding requested tasks.
You have created some distinction between “all preceding requested tasks” (which is correct) and “all tasks”, whatever that means. I don’t find a mention of “all tasks” whatever that means, in any of the links you have provided.
cudaDeviceSynchronize() blocks until all preceding requested tasks have been completed, just like it says in the documentation.
The schedule flag affects CPU thread behavior while that block is occurring. A description of that varying host thread behavior is covered to some degree in the doc link I provided as well as a forum link you provided from tera.
None of that draws a distinction between “preceding requested tasks” and “all tasks”. I don’t find that distinction anywhere and I reject using it for understanding or definition. It looks made-up to me.