Hello,
I have a data structure implemented with the unified memory, with the goal of spanning the data on multiple GPUs and operate on the data from any device (sort of considering the unified memory as a global address space as in the PGAS runtimes). I have a kernel which has nested call to itself (as in CUDA dynamic parallelism) that operates on different piece of memory contained in the allocated data structure on the unified memory space. My question is, does CUDA runtime support spawning the CDP kernel on a different device where the data resides (in unified memory) or will the page faulting mechanism of unified memory will bring the data to whichever device is spawning the kernel and will operate on it? In simplest terms, does CUDA runtime currently support dynamic parallelism on multi-GPUs? Thanks in advance for any help!