Cuda kernels using unified memory fail with many blocks (but works with few)

I’m having trouble understanding why my kernels fail to get all the way through when I submit some 20,000 blocks for execution while using unified memory model. This is approximate - sometimes a little more, sometimes a little less.
I have verified that there are no bugs in kernel code, as I am able to submit all 20,000 blocks one by one and get all the way through obtaining the complete and correct result (obviously takes long time).
I do use pointers to pointers in my data structures so they are somewhat non-trivial. They are obviously all set up correctly, as indicated above and I am 100% sure of that. It would be an incredible pain to achieve this using cudaMemcpy, although I’ve tried it and it worked.

Are there limitations to number of blocks, depth or complication of pointer structures etc when using unified memory?

Thanks!

there shouldn’t be. If you are on a GPU that is also hosting a display (or a windows WDDM GPU even if not hosting a display), you may simply be hitting a kernel timeout.

Usually rigorous CUDA error checking and running your code with cuda-memcheck may shed some light on the situation.

Thank you very much for the answer!

I was re-writing the calculations to use unified memory after initially proving the feasibility of the procedure using cudaMemcpy and host/device memory model, which worked great. I was able to run calculations for up to 10 or 20 seconds on that GPU (1070) without any issues with tens of thousands of blocks. That gave me the confidence to try to re-structure the ugly code taking advantage of unified memory. However I’m unable to successfully run the calculations so far. I am using VS 2017 and I did run cuda-memcheck on debug version of the executable. This provides little insight however. Debugging is somewhat tedious - I have to basically modify the code and run it to see if it makes any difference or not.

How do I make sure I’m not hitting the kernel timeout? Can I take the GPU out of the equation for Windows and use it only for calculations?

What would be an example of rigorous error checking? or an example of reading output of cuda-memcheck to pin down the code line in the kernel that triggers the issue?

Thanks!

wddm tdr timeout:

[url]https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/timeout_detection_recovery.htm[/url]

rigorous error checking:

[url]What is the canonical way to check for errors using the CUDA runtime API? - Stack Overflow

If you haven’t done anything with the TDR system on windows and you are running on a GeForce GPU, you will hit the timeout when your kernel starts to take about 2 seconds or longer.

My guess right now is you are hitting a WDDM TDR timeout.

If CUDA memcheck reports an illegal access error, this is a useful debug method for that:

[url]cuda - Unspecified launch failure on Memcpy - Stack Overflow

Thank you so much, I’ll make a thorough study and investigation of the issues you indicated. This information is very helpful to me at this point