sanity check: when do I need to synchronize kernel launches?

Consider a simple three step procedure:

// create event A
cudaMemcpyAsync(d_depthData, h_depthData, numBytes, cudaMemcpyHostToDevice, stream);
// sync event A
// create event B
// using the depth data we just copied, compute the positions
computePositionsKernel<<< grid, block, 0, stream >>>(d_depthData, d_positions);
// sync event B
// using the positions we just computed, compute the normals
computeNormalsKernel<<< grid, block, 0, stream >>>(d_positions, d_normals);
  1. Does “event A” need to be synchronized? I do an async copy because I want everything to operate on this stream, and I cannot find the original source, but I distinctly remember reading or watching a video that explained that if you do an asynchronous copy, the next kernel to use that destination implicitly synchronizes until the data is ready. Is that true? None of the official documentation seems to indicate this behavior at all.

  2. If (1) is does not need to be synchronized, does event B need to be synchronized?

I’ve removed synchronization code and everything works the same, but I feel like this is actually just a limitation of my (low end) GPU not actually being able to run concurrent kernels.

If there is a data-dependence between two kernels, is it correct to assume that I should ALWAYS be synchronizing?

Thank you for any sanity checks, I want to make sure my code works for people who have real GPUs as well ;)

an event needs to be recorded, not just created, in order to use it in any way.

all CUDA activity issued to a particular stream will serialize. Always.

“A stream is a sequence of commands (possibly issued by different host threads) that execute in order.”