Launch multiple kernels while using milti-process service (MPS)

jimmyz500 · December 11, 2014, 6:49am

Hi, I have a JCuda program where I am trying to utilize MPS to pass a pointer to data between two kernels. The first kernel just allocates some data on the device, and I pass the pointer of the data allocated to the second kernel so all it has to do is read it. The use case is so that data doesn’t have to be copied in again.

My GPU supports MPS, and I can see it is registering processes fine. The second kernel launches, cuLaunchKernel returns status 0, but gets stuck and does not return from cuCtxSynchronize.

Does anyone have any suggestions on what might be happening here?

Thx

little_jimmy · December 11, 2014, 7:41am

perhaps the 2nd kernel is crashing, thus the reason why the 1st kernel waits forever

add a breakpoint to the 1st line of the 2nd kernel and step it in the debugger; or add a breakpoint to the last line of the 2nd kernel, and see if it exits; or add breakpoints as beacons across the 2nd kernel, and note the last beacon reached

Robert_Crovella · December 11, 2014, 8:30am

A pointer created in the address space of one process is not directly usable in another process address space. I’m not sure what you mean by “utilize MPS to pass a pointer to data between two kernels”. The purpose of MPS is to allow kernels from different processes to execute concurrently. It doesn’t have anything to do with user data sharing.

The MPS documentation may be of interest:

[url]GPU Deployment and Management Documentation

If you want to allow an allocation from one process to be accessed by another process, the recommended approach is CUDA IPC. There was a recent discussion here:

[url]https://devtalk.nvidia.com/default/topic/794987/cuda-programming-and-performance/gpu-inter-process-communications-ipc-question/[/url]

jimmyz500 · December 11, 2014, 8:35pm

Thank you both for your suggestions. txbob, we are trying to exploit the following stated in the MPS documentation:

MPS client processes allocate memory from different partitions of the same GPU virtual
address space.
An out-of-range read in a CUDA Kernel can access CUDA-accessible memory
modified by another process.

Not trying to do an out-of-range read, but by passing the pointer to data that’s already allocated on the GPU to the second kernel. The first kernel runs and returns OK, but the second kernel gets stuck. I’m using Intellij and JCuda, and can’t set a breakpoint in the 2nd kernel. As I mentioned, for the 2nd kernel, cuLaunchKernel returns status 0, but gets stuck and does not return from cuCtxSynchronize. I don’t see any printfs in the 2nd kernel printed.

Thanks

Robert_Crovella · December 11, 2014, 11:52pm

The MPS documentation has no stated support for pointer sharing, that I can see.

Architecturally, yes, every process is taking a separate chunk of the virtual address space of the GPU, for its own needs. This does not mean that each process has the same logical->virtual address mapping. The virtual space is unified/harmonized, but each process maintains it’s own logical->virtual mapping. This means a pointer in one process has no meaning when dereferenced in another process.

cuLaunchKernel will not return an error, as the launch process has no way of knowing the pointer is invalid. It will attempt to launch that kernel, which will begin executing until it dereferences that bogus pointer. At that point, bad things will happen. I would expect the failed launch to show up at the next synchronize point, but I’m just speculating, and working off your description.

The document’s statement about out-of-range reads is exactly that: a warning that there is no enforced interprocess memory security provided by the GPU/driver.

As I’ve already mentioned, CUDA IPC is provided to help you work around this.

As a simple test, I took the two-test-app sample code that I provided in the IPC thread that I linked, and put a printf statement in each app to print out the numerical value of the data variable (the pointer that was “shared” via the IPC mechanism). This is a 64-bit linux system and UVA is in effect. The numerical values of the pointers are not the same between the two processes. (You could try repeating this experiment if you like in your MPS setup, it should not be difficult.) Passing a numerical pointer value directly from one process to another is going to be problematic.

You may want to read sections 3.2.7 and 3.2.8 of the CUDA programming guide:

[url]Programming Guide :: CUDA Toolkit Documentation

“Any device memory pointer or event handle created by a host thread can be directly referenced by any other thread within the same process. It is not valid outside this process however, and therefore cannot be directly referenced by threads belonging to a different process.”

I know of nothing in MPS that abrogates that.

If you think carefully about the implications of unified virtual addressing in a multi-process environment, I think it will become clear to you that the CUDA driver must maintain its own mappings of logical->virtual address, which mappings may vary from process to process.

jimmyz500 · December 12, 2014, 12:24am

Thanks so much for your detailed explanation. I will definitely do the simple test and look at CUDA IPC.

If I was within a single process, pointer sharing, however, should work ok, right?

Thanks.

Robert_Crovella · December 12, 2014, 12:28am

My previous response above excerpted this statement from the programming guide:

“Any device memory pointer or event handle created by a host thread can be directly referenced by any other thread within the same process.”

Topic		Replies	Views
How to share CUDA memory between two processes? CUDA Programming and Performance	3	2923	July 9, 2018
Question about GPU sharing of Multi-process service CUDA Programming and Performance	9	6513	April 30, 2018
Question about CUDA MPS CUDA Programming and Performance	15	2779	August 22, 2022
GPU sharing among different application with different CUDA context CUDA Programming and Performance	23	18272	December 17, 2020
MultiGPU start help CUDA Programming and Performance	8	10522	August 10, 2010
How to access gpu memory between processes CUDA Programming and Performance	10	2586	August 4, 2023
What is the best way to partition the SM of a GPU? CUDA Programming and Performance hw , cuda , kernel	2	1052	August 17, 2023
MPS is not working CUDA Programming and Performance	7	3087	July 13, 2022
Scheduling a kernel asynchronously from inside another kernel CUDA Programming and Performance	3	401	May 12, 2023
MPS client failed to reserve virtual memory range at address (nil) CUDA Programming and Performance	2	882	January 11, 2020

Launch multiple kernels while using milti-process service (MPS)

Related topics