How to Launch Cuda kernel in different processes

Mike_Hao · December 27, 2017, 3:21pm

In Windows 7 System, I Want to create multi processses. in each process, I want to launch cuda kernel. Is these kernels from different processes conflict when running at same time?

And in this case, is device memory enough to use? Is device memory from different processes separated?

Robert_Crovella · December 27, 2017, 5:10pm

Kernels from different processes won’t run concurrently.
The GPU will service the kernels/processes in a round-robin fashion.
You’ll need to make sure that the device memory is large enough to support all the processes you wish to run concurrently. For example, suppose you have 3 processes. If each process starts at the same time, and wants to allocate 1GB of GPU memory, you’ll need to make sure your GPU has at least 3GB of memory available. After that, if the kernels (let’s assume 1 from each process) are launched at the same time, they will run sequentially.

Mike_Hao · December 30, 2017, 3:08pm

Thanks,Txbob.

I have another question: In one process, I create two streams for one CPU, In each stream, I lunched two same kernels. Is these kernels from one stream can run concurrently with kenels from another stream?
The sample code is:

for (int i = 0; i < 2; i++)
	{
		cudaStatus = cudaEventRecord(start[i], stream[i]);
		cudaStatus = cudaMemcpyAsync(pDevA[i], pHostA[i], sizeof(float)*nWidth*nHeight, cudaMemcpyHostToDevice, stream[i]);
	
		TestKernel <<< gridSize, blockSize, 0, stream[i] >>>(pDevC[i], pDevA[i], nWidth, nHeight, 100);
		cudaStatus = cudaMemcpyAsync(pHostC[i], pDevC[i], sizeof(float)*nWidth*nHeight, cudaMemcpyDeviceToHost, stream[i]);
		
		TestKernel2 << < gridSize, blockSize, 0, stream[i] >> >(pDevC[i], pDevA[i], nWidth, nHeight, 100);
				
		cudaStatus = cudaMemcpyAsync(pHostC[i], pDevC[i], sizeof(float)*nWidth*nHeight, cudaMemcpyDeviceToHost, stream[i]);
		cudaStatus = cudaEventRecord(stop[i], stream[i]);
	}

And how about the stream execution time?

Robert_Crovella · December 30, 2017, 3:23pm

Yes, theoretically kernels from one stream can run concurrently with kernels from another stream. There are a number of requirements that have to be met to actually witness this however, mostly around resource utilization of the kernels. In practice, kernel concurrency is fairly hard to witness, in my experience. It requires a careful design. There is a concurrent kernels sample code you can test and study if you wish. Also, in your case, you are doing a depth-first launch. Since you have multiple operations (kernels, etc.) per stream, you may also want to experiment with a breadth-first launch.

I have no idea what this question means:

“And how about the stream execution time?”

Mike_Hao · January 3, 2018, 1:17pm

Hi, txbob.

In my case, I use GPU to speed up image process: In one second, I need process 30 images of 1024*1024 pixels. But actually, processing one image need 45 milliseconds. I must investigate some way to speed up. I have tried some way, but no effects.

1st, I use two threams(One GPU Device) to process, one stream process odd images, another process even images. I think these two streams can be concurrent, but test result is not. Do you have some idea about stream concurrent?

2nd, Is there any way to observe how many GPU resources(such as GPU cores) were used when process image?

3rd, Is there any way to assign fixed cores to one kernel(or one stream), such as one kernel use 600 GPU cores, another use another 600 GPU cores?

4th, Do you have some idea to speed up GPU Process?

Robert_Crovella · January 3, 2018, 2:36pm

I think it’s unlikely that kernels written to do image processing will run concurrently. Many of these kernels will assign one thread per pixel, and with an image of any significant size, this will create enough threads and therefore blocks to “fill” a GPU, preventing any meaningful concurrency. It should be possible to get concurrency on data copying to/from GPU with GPU kernel execution.

Use a profiler.

CUDA doesn’t provide any built-in method to partition GPU HW resources.

Optimize the code?

vjeng · November 6, 2018, 10:24pm

I understand that kernels launched from two separate processes cannot run concurrently.

I am profiling two processes running on a TX2 and it appears that the kernels launched by the two processes are running concurrently. Can you help me understand why it is showing up that way in the profiler? I will try to attach an nvvp screenshot for your reference.

The two processes are running an identical sequence of kernels if that makes a difference.

Thanks in advance.

Robert_Crovella · November 6, 2018, 10:36pm

Newer CUDA versions allow for the time-slicing of work between two processes. It used to be that CUDA kernels from one process would run to completion before a context switch could happen.

In more recent versions of CUDA on Pascal hardware and higher, there is the possibility for interprocess time-sliced access to the GPU. The net effect of this is that it “appears” as if both kernels are running simultaneously, however what is happening under the hood is that first one kernel/process runs “for a while” then another kernel/process runs “for a while” etc.

[url]gpu - How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications? - Stack Overflow

(Read the UPDATE)

Your TX2 is a pascal-generation GPU.

vjeng · November 6, 2018, 11:17pm

Ahh…very interesting. The average duration of the same kernels “appear” to take about twice as long when running two processes simultaneously, which makes sense since the kernels are not able to execute in the same clock cycles. Thanks again!

Topic		Replies	Views
My streams are not running concurrently CUDA Programming and Performance	7	1740	March 6, 2018
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3002	January 19, 2018
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5541	April 28, 2012
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2381	April 29, 2019
Using CUDA to run many instances CUDA Programming and Performance	10	3227	April 1, 2012
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1423	September 14, 2017
Concurrent kernel execution without stream CUDA Programming and Performance	7	2442	December 28, 2016
Concurrent kernel execution CUDA Programming and Performance	2	257	March 26, 2024
Multiple kernels in flight? CUDA Programming and Performance	19	26819	August 28, 2007
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	905	June 18, 2010

How to Launch Cuda kernel in different processes

Related topics