cudaStream with Managed memory

saulocpp · October 21, 2018, 7:16pm

I have 3 global functions that write to different parts of an array, so I gave a try to cudaStream and created 3 streams in the form:

cudaStreamCreateWithFlags(&streamX, cudaStreamNonBlocking);

Then I profiled the program and noticed that the functions were not executing concurrently, just as they would serially instead.
I’m suspecting this is because of the need of cudaDeviceSynchronize() after each kernel call that uses managed memory, essentially nullifying the benefit of cudaStream.
Is this guess at all correct, as the examples I found around just use the non-managed style? I also tried the regular:

cudaStreamCreate(&streamX);

but it makes no difference.

Robert_Crovella · October 21, 2018, 8:59pm

You only need a cudaDeviceSychronize() after a kernel call (in the managed memory case) if you intend to immediately use the data in host code after the kernel call. If you are launching 3 kernels back-to-back, there is no need to put a cudaDeviceSynchronize() in-between them.

Yes, if you use cudaDeviceSynchronize() in between kernel calls (or any sections of code) there is no opportunity for device concurrency of any kind between those sections.

Kernel concurrency (one kernel overlapping with another) is generally hard to witness anyway. It needs kernels that execute for a long enough period of time (at least longer than the launch latency) but also have relatively limited resource usage.

Kernel concurrency is a last-ditch effort to get more performance out of the machine. It’s far better to organize your code into kernels that will saturate the machine. Such kernels won’t overlap, but since you are saturating the machine, there would be essentially nothing to be gained from overlap.

saulocpp · October 21, 2018, 10:12pm

Thanks for the considerations on cudaStream, Robert.
I finished writing and testing the kernels that deal with the head and tail of an array being worked by a stencil and considered running them in streams. But their running time is so short that I don’t really think the microseconds saved will be noticed.

Regarding the cudaDeviceSynchronize(), what you mentioned is exactly what I learned reading everything at my reach, but I experience something different in a particular case. I got a function that wraps 5 kernels which comprise a bigger computation, and if I don’t sync after each kernel (except for the last one, before the cudaFree calls), the program will crash.

Using VS debugger, it shows an unhandled exception, while cuda-memcheck says that:

========= Error: process didn't terminate successfully
=========        The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under a host debugger to catch such errors.

I don’t know if the fact that all of my CUDA code is packed in a DLL (compiled by VS) and being called by a program that has the graphic interface (compiled by GCC) has anything to do with this behavior. But, no, I’m not accessing the managed arrays from the host after a kernel, I’m just passing them on and on. I thought I was done debugging the stencil…

Robert_Crovella · October 22, 2018, 2:00am

maybe you are exceeding the WDDM TDR timeout limit.

saulocpp · October 22, 2018, 7:25am

I have the time-out at around 20 seconds so cuda-memcheck and nvvp can work comfortably.
The kernels are called in this order:

reduction_float <bl_Size> <<< gr_Size, bl_Size, smem_Size >>> (array_float_in, temp_double_out, length);
reduction_double <bl_Size> <<< 1, bl_Size, smem_Size >>> (temp_double_out, final_double_reduct, gr_Size);
cudaDeviceSynchronize();

stencil_head_array <<< gr_Size, bl_Size smem_Size >>> (array_float_in, array_float_out, final_double_reduct);
cudaDeviceSynchronize();

stencil_middle_array <<< gr_Size, bl_Size smem_Size >>> (array_float_in, array_float_out, final_double_reduct);
cudaDeviceSynchronize();

stencil_tail_array <<< gr_Size, bl_Size smem_Size >>> (array_float_in, array_float_out, final_double_reduct);
cudaDeviceSynchronize();

Notice that I don’t sync between the reductions as it works fine (and it should anyway!). So maybe there is already something being propagated at this point and the subsequent syncs just hide the problem.
What I will do today is use a macro shared by Njuffa to catch kernel launch errors. No keyboards were harmed (yet) in previous debugging days…

saulocpp · October 23, 2018, 8:55am

After some hours investigating, cudaGetLastError() returns a code 0 every time I call one of these kernels. But if I suppress any of these cudaDeviceSynchronize(), VS debugger will catch an unhandled exception.

In a few months, hopefully I have most of the code working and ready to compile the whole thing in Linux, which will be a much easier task as only GCC will be needed for the CUDA and GTKmm, as opposed to VS + GCC in Windows. I will also not need to create any external lib, then I try again without these syncs.

Time to leave this thing alone for a moment and move on to the next task.
Thanks for your inputs, TxBob.

Topic		Replies	Views
Do i really need to use cudaDeviceSynchronize in this scenario ? CUDA Programming and Performance	2	1091	February 11, 2019
stream synchronize problem CUDA Programming and Performance	2	793	August 28, 2017
Synchronization problem CUDA Programming and Performance	0	887	December 3, 2012
cudaStream and managed memory CUDA Programming and Performance	2	72	July 7, 2025
CUDA non-default stream synchronization CUDA Programming and Performance jetson-orin	4	558	October 30, 2024
cudaThreadSynchronize usage CUDA Programming and Performance	3	2981	October 21, 2008
cudaDeviceSynchronize() returns cudaErrorMemoryAllocation CUDA Programming and Performance	1	552	February 2, 2018
About Synchronize CUDA Programming and Performance	4	1514	March 26, 2009
Got wrong result when not using cudaDeviceSynchronize in threads CUDA Programming and Performance	6	914	February 1, 2024
cudaDeviceSynchronize is very slow CUDA Programming and Performance	1	2149	July 31, 2014

cudaStream with Managed memory

Related topics