cudaStream with Managed memory

I have 3 global functions that write to different parts of an array, so I gave a try to cudaStream and created 3 streams in the form:

cudaStreamCreateWithFlags(&streamX, cudaStreamNonBlocking);

Then I profiled the program and noticed that the functions were not executing concurrently, just as they would serially instead.
I’m suspecting this is because of the need of cudaDeviceSynchronize() after each kernel call that uses managed memory, essentially nullifying the benefit of cudaStream.
Is this guess at all correct, as the examples I found around just use the non-managed style? I also tried the regular:

cudaStreamCreate(&streamX);

but it makes no difference.

You only need a cudaDeviceSychronize() after a kernel call (in the managed memory case) if you intend to immediately use the data in host code after the kernel call. If you are launching 3 kernels back-to-back, there is no need to put a cudaDeviceSynchronize() in-between them.

Yes, if you use cudaDeviceSynchronize() in between kernel calls (or any sections of code) there is no opportunity for device concurrency of any kind between those sections.

Kernel concurrency (one kernel overlapping with another) is generally hard to witness anyway. It needs kernels that execute for a long enough period of time (at least longer than the launch latency) but also have relatively limited resource usage.

Kernel concurrency is a last-ditch effort to get more performance out of the machine. It’s far better to organize your code into kernels that will saturate the machine. Such kernels won’t overlap, but since you are saturating the machine, there would be essentially nothing to be gained from overlap.

Thanks for the considerations on cudaStream, Robert.
I finished writing and testing the kernels that deal with the head and tail of an array being worked by a stencil and considered running them in streams. But their running time is so short that I don’t really think the microseconds saved will be noticed.

Regarding the cudaDeviceSynchronize(), what you mentioned is exactly what I learned reading everything at my reach, but I experience something different in a particular case. I got a function that wraps 5 kernels which comprise a bigger computation, and if I don’t sync after each kernel (except for the last one, before the cudaFree calls), the program will crash.

Using VS debugger, it shows an unhandled exception, while cuda-memcheck says that:

========= Error: process didn't terminate successfully
=========        The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under a host debugger to catch such errors.

I don’t know if the fact that all of my CUDA code is packed in a DLL (compiled by VS) and being called by a program that has the graphic interface (compiled by GCC) has anything to do with this behavior. But, no, I’m not accessing the managed arrays from the host after a kernel, I’m just passing them on and on. I thought I was done debugging the stencil…

maybe you are exceeding the WDDM TDR timeout limit.

I have the time-out at around 20 seconds so cuda-memcheck and nvvp can work comfortably.
The kernels are called in this order:

reduction_float <bl_Size> <<< gr_Size, bl_Size, smem_Size >>> (array_float_in, temp_double_out, length);
reduction_double <bl_Size> <<< 1, bl_Size, smem_Size >>> (temp_double_out, final_double_reduct, gr_Size);
cudaDeviceSynchronize();

stencil_head_array <<< gr_Size, bl_Size smem_Size >>> (array_float_in, array_float_out, final_double_reduct);
cudaDeviceSynchronize();

stencil_middle_array <<< gr_Size, bl_Size smem_Size >>> (array_float_in, array_float_out, final_double_reduct);
cudaDeviceSynchronize();

stencil_tail_array <<< gr_Size, bl_Size smem_Size >>> (array_float_in, array_float_out, final_double_reduct);
cudaDeviceSynchronize();

Notice that I don’t sync between the reductions as it works fine (and it should anyway!). So maybe there is already something being propagated at this point and the subsequent syncs just hide the problem.
What I will do today is use a macro shared by Njuffa to catch kernel launch errors. No keyboards were harmed (yet) in previous debugging days…

After some hours investigating, cudaGetLastError() returns a code 0 every time I call one of these kernels. But if I suppress any of these cudaDeviceSynchronize(), VS debugger will catch an unhandled exception.

In a few months, hopefully I have most of the code working and ready to compile the whole thing in Linux, which will be a much easier task as only GCC will be needed for the CUDA and GTKmm, as opposed to VS + GCC in Windows. I will also not need to create any external lib, then I try again without these syncs.

Time to leave this thing alone for a moment and move on to the next task.
Thanks for your inputs, TxBob.