Hello, I’m stumped by this strange issue. I find that even issuing a cudaEventRecord() to a stream is preventing tasks from executing in parallel. Previously, I was surprised that a cudaMemcpyAsync() and cudaMemsetAsync() are also preventing everything issued after them from running before them, but that behavior was stated in the CUDA programming manual. But it absolutely doesn’t make any sense why is cudaEventRecord() is doing this too?
I’m using CUDA 5 with a GTX 560Ti.
__global__ void Dummy(uint16_t *data, int n)
{
__shared__ uint16_t scratch[512];
for (int i = threadIdx.x; i < n; i += blockDim.x)
scratch[i % 512] = data[i];
}
void TestOverlap()
{
uint16_t *data;
const int N = 80000000;
assert(cudaMalloc(&data, N * sizeof(uint16_t)) == cudaSuccess);
cudaStream_t stream1, stream2;
assert(cudaStreamCreate(&stream1) == 0);
assert(cudaStreamCreate(&stream2) == 0);
cudaEvent_t stream1Event;
assert(cudaEventCreate(&stream1Event) == 0);
for (int repeat = 0; repeat < 2; ++repeat)
{
Dummy<<<9, 64, 0, stream1>>>(data, 100000);
//assert(cudaEventRecord(stream1Event, stream1) == 0);
for (int i = 0; i < 9; ++i)
cudaMemsetAsync(&data[i * 1000000], i, 10000000, stream2);
}
assert(cudaThreadSynchronize() == 0);
cudaDeviceReset();
}
Here’s the timeline from CUDA Visual Profiler:
Also, can someone give an explanation for why should cudaMemsetAsync() or cudaMemcpyAsync() (from a different stream) serve as synchronization point (prevent kernels issued after them from executing before it completes). Conceptually, I’m issuing it to another parallel stream, so why synchronize ?
Thank you for any help