Launch multiple kernels while using milti-process service (MPS)

Hi, I have a JCuda program where I am trying to utilize MPS to pass a pointer to data between two kernels. The first kernel just allocates some data on the device, and I pass the pointer of the data allocated to the second kernel so all it has to do is read it. The use case is so that data doesn’t have to be copied in again.

My GPU supports MPS, and I can see it is registering processes fine. The second kernel launches, cuLaunchKernel returns status 0, but gets stuck and does not return from cuCtxSynchronize.

Does anyone have any suggestions on what might be happening here?

Thx

perhaps the 2nd kernel is crashing, thus the reason why the 1st kernel waits forever

add a breakpoint to the 1st line of the 2nd kernel and step it in the debugger; or add a breakpoint to the last line of the 2nd kernel, and see if it exits; or add breakpoints as beacons across the 2nd kernel, and note the last beacon reached

A pointer created in the address space of one process is not directly usable in another process address space. I’m not sure what you mean by “utilize MPS to pass a pointer to data between two kernels”. The purpose of MPS is to allow kernels from different processes to execute concurrently. It doesn’t have anything to do with user data sharing.

The MPS documentation may be of interest:

https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

If you want to allow an allocation from one process to be accessed by another process, the recommended approach is CUDA IPC. There was a recent discussion here:

https://devtalk.nvidia.com/default/topic/794987/cuda-programming-and-performance/gpu-inter-process-communications-ipc-question/

Thank you both for your suggestions. txbob, we are trying to exploit the following stated in the MPS documentation:

MPS client processes allocate memory from different partitions of the same GPU virtual
address space.
An out-of-range read in a CUDA Kernel can access CUDA-accessible memory
modified by another process.

Not trying to do an out-of-range read, but by passing the pointer to data that’s already allocated on the GPU to the second kernel. The first kernel runs and returns OK, but the second kernel gets stuck. I’m using Intellij and JCuda, and can’t set a breakpoint in the 2nd kernel. As I mentioned, for the 2nd kernel, cuLaunchKernel returns status 0, but gets stuck and does not return from cuCtxSynchronize. I don’t see any printfs in the 2nd kernel printed.

Thanks

The MPS documentation has no stated support for pointer sharing, that I can see.

Architecturally, yes, every process is taking a separate chunk of the virtual address space of the GPU, for its own needs. This does not mean that each process has the same logical->virtual address mapping. The virtual space is unified/harmonized, but each process maintains it’s own logical->virtual mapping. This means a pointer in one process has no meaning when dereferenced in another process.

cuLaunchKernel will not return an error, as the launch process has no way of knowing the pointer is invalid. It will attempt to launch that kernel, which will begin executing until it dereferences that bogus pointer. At that point, bad things will happen. I would expect the failed launch to show up at the next synchronize point, but I’m just speculating, and working off your description.

The document’s statement about out-of-range reads is exactly that: a warning that there is no enforced interprocess memory security provided by the GPU/driver.

As I’ve already mentioned, CUDA IPC is provided to help you work around this.

As a simple test, I took the two-test-app sample code that I provided in the IPC thread that I linked, and put a printf statement in each app to print out the numerical value of the data variable (the pointer that was “shared” via the IPC mechanism). This is a 64-bit linux system and UVA is in effect. The numerical values of the pointers are not the same between the two processes. (You could try repeating this experiment if you like in your MPS setup, it should not be difficult.) Passing a numerical pointer value directly from one process to another is going to be problematic.

You may want to read sections 3.2.7 and 3.2.8 of the CUDA programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-virtual-address-space

“Any device memory pointer or event handle created by a host thread can be directly referenced by any other thread within the same process. It is not valid outside this process however, and therefore cannot be directly referenced by threads belonging to a different process.”

I know of nothing in MPS that abrogates that.

If you think carefully about the implications of unified virtual addressing in a multi-process environment, I think it will become clear to you that the CUDA driver must maintain its own mappings of logical->virtual address, which mappings may vary from process to process.

Thanks so much for your detailed explanation. I will definitely do the simple test and look at CUDA IPC.

If I was within a single process, pointer sharing, however, should work ok, right?

Thanks.

My previous response above excerpted this statement from the programming guide:

“Any device memory pointer or event handle created by a host thread can be directly referenced by any other thread within the same process.”