Cuda kernels using unified memory fail with many blocks (but works with few)

boris.akselrud · July 31, 2018, 5:17pm

I’m having trouble understanding why my kernels fail to get all the way through when I submit some 20,000 blocks for execution while using unified memory model. This is approximate - sometimes a little more, sometimes a little less.
I have verified that there are no bugs in kernel code, as I am able to submit all 20,000 blocks one by one and get all the way through obtaining the complete and correct result (obviously takes long time).
I do use pointers to pointers in my data structures so they are somewhat non-trivial. They are obviously all set up correctly, as indicated above and I am 100% sure of that. It would be an incredible pain to achieve this using cudaMemcpy, although I’ve tried it and it worked.

Are there limitations to number of blocks, depth or complication of pointer structures etc when using unified memory?

Thanks!

Robert_Crovella · July 31, 2018, 7:48pm

there shouldn’t be. If you are on a GPU that is also hosting a display (or a windows WDDM GPU even if not hosting a display), you may simply be hitting a kernel timeout.

Usually rigorous CUDA error checking and running your code with cuda-memcheck may shed some light on the situation.

boris.akselrud · July 31, 2018, 8:00pm

Thank you very much for the answer!

I was re-writing the calculations to use unified memory after initially proving the feasibility of the procedure using cudaMemcpy and host/device memory model, which worked great. I was able to run calculations for up to 10 or 20 seconds on that GPU (1070) without any issues with tens of thousands of blocks. That gave me the confidence to try to re-structure the ugly code taking advantage of unified memory. However I’m unable to successfully run the calculations so far. I am using VS 2017 and I did run cuda-memcheck on debug version of the executable. This provides little insight however. Debugging is somewhat tedious - I have to basically modify the code and run it to see if it makes any difference or not.

How do I make sure I’m not hitting the kernel timeout? Can I take the GPU out of the equation for Windows and use it only for calculations?

What would be an example of rigorous error checking? or an example of reading output of cuda-memcheck to pin down the code line in the kernel that triggers the issue?

Thanks!

Robert_Crovella · July 31, 2018, 8:50pm

wddm tdr timeout:

[url]https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/timeout_detection_recovery.htm[/url]

rigorous error checking:

[url]What is the canonical way to check for errors using the CUDA runtime API? - Stack Overflow

If you haven’t done anything with the TDR system on windows and you are running on a GeForce GPU, you will hit the timeout when your kernel starts to take about 2 seconds or longer.

My guess right now is you are hitting a WDDM TDR timeout.

If CUDA memcheck reports an illegal access error, this is a useful debug method for that:

[url]cuda - Unspecified launch failure on Memcpy - Stack Overflow

boris.akselrud · July 31, 2018, 9:01pm

Thank you so much, I’ll make a thorough study and investigation of the issues you indicated. This information is very helpful to me at this point

Topic		Replies	Views
Getting around apparent CUDA bugs CUDA Programming and Performance	5	966	September 20, 2011
Example not working on more than million elements CUDA Programming and Performance	3	641	June 12, 2017
CUDA kernels keep on crashing CUDA Programming and Performance	6	3644	October 27, 2008
Kernel invocation invalidates unified memory blocks CUDA Programming and Performance	9	1070	January 8, 2018
cudaError_enum Strange kernel failure CUDA Programming and Performance	6	9440	October 3, 2009
WDDM Timeout Detection and Recovery CUDA Programming and Performance	1	639	December 12, 2013
Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure? CUDA Programming and Performance	6	740	March 20, 2018
Strannge behaviour of kernel: unspecified launch failure CUDA Programming and Performance	4	1109	August 15, 2018
Strange access to memory CUDA Programming and Performance	4	583	March 17, 2014
CUDA Timeout? CUDA Programming and Performance	7	27687	December 19, 2011

Cuda kernels using unified memory fail with many blocks (but works with few)

Related topics