Error in cudaMemCpyAsync


I am testing my application that calculate Strongly Connected Components of several graphs in parallel. Each graph has its own stream to offload the calculation and retrieve the result . For few small graphs everything works, but when I increase the number and size of the graphs, after some minutes of computation I got the following error:

#0  0x00007ffff631b3ea in ?? () from /lib/x86_64-linux-gnu/
#1  0x00007ffff628512b in ?? () from /lib/x86_64-linux-gnu/
#2  0x00007ffff6285498 in ?? () from /lib/x86_64-linux-gnu/
#3  0x00007ffff65103d7 in ?? () from /lib/x86_64-linux-gnu/
#4  0x00007ffff65108a2 in ?? () from /lib/x86_64-linux-gnu/
#5  0x00007ffff62b767e in ?? () from /lib/x86_64-linux-gnu/
#6  0x00007ffff651548d in ?? () from /lib/x86_64-linux-gnu/
#7  0x00007ffff626bbf0 in ?? () from /lib/x86_64-linux-gnu/
#8  0x00007ffff626c3c4 in ?? () from /lib/x86_64-linux-gnu/
#9  0x00007ffff626e019 in ?? () from /lib/x86_64-linux-gnu/
#10 0x00007ffff62dd959 in ?? () from /lib/x86_64-linux-gnu/
#11 0x000055555569d819 in __cudart601 ()
#12 0x00005555556709cd in __cudart738 ()
#13 0x00005555556c1d92 in cudaMemcpyAsync ()

The memory pointers involved in the cudaMemcpyAsync seems ok. If I process the graphs sequentially everything seems to work fine. Memcheck is not helping since it force every call to be synchronous making the whole process sequential.

Do you have any suggestion about how to debug this error?

  • be sure you are doing proper, rigorous CUDA error checking throughout your code
  • provide more information/description about the error/problem
  • seek to create a minimal reproducer. By removing things that don’t appear to be related to the error, you will reduce the scope of the problem. This is generally useful for you (it often results in important discoveries), and would be useful later if you ever decided to ask others for debug help
  • try different CUDA versions and GPU driver versions. Bugs appear and are fixed with regularity. It’s the nature of software, based on my observation.
  • log the specifics of every call to cudaMemcpyAsync. When the error occurs, check the logs for the specifics, to see if the pointers make sense, the direction makes sense, etc.

It turned out to be my host code that use a lot of stack space. Unfortunately the stack ran out of space for a request from the the CUDA runtime, leading me to think about a CUDA related bug. I solved increasing the stack space by the command: ulimit -s unlimited

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.