How to Launch Cuda kernel in different processes

In Windows 7 System, I Want to create multi processses. in each process, I want to launch cuda kernel. Is these kernels from different processes conflict when running at same time?

And in this case, is device memory enough to use? Is device memory from different processes separated?

Kernels from different processes won’t run concurrently.
The GPU will service the kernels/processes in a round-robin fashion.
You’ll need to make sure that the device memory is large enough to support all the processes you wish to run concurrently. For example, suppose you have 3 processes. If each process starts at the same time, and wants to allocate 1GB of GPU memory, you’ll need to make sure your GPU has at least 3GB of memory available. After that, if the kernels (let’s assume 1 from each process) are launched at the same time, they will run sequentially.


I have another question: In one process, I create two streams for one CPU, In each stream, I lunched two same kernels. Is these kernels from one stream can run concurrently with kenels from another stream?
The sample code is:

for (int i = 0; i < 2; i++)
		cudaStatus = cudaEventRecord(start[i], stream[i]);
		cudaStatus = cudaMemcpyAsync(pDevA[i], pHostA[i], sizeof(float)*nWidth*nHeight, cudaMemcpyHostToDevice, stream[i]);
		TestKernel <<< gridSize, blockSize, 0, stream[i] >>>(pDevC[i], pDevA[i], nWidth, nHeight, 100);
		cudaStatus = cudaMemcpyAsync(pHostC[i], pDevC[i], sizeof(float)*nWidth*nHeight, cudaMemcpyDeviceToHost, stream[i]);
		TestKernel2 << < gridSize, blockSize, 0, stream[i] >> >(pDevC[i], pDevA[i], nWidth, nHeight, 100);
		cudaStatus = cudaMemcpyAsync(pHostC[i], pDevC[i], sizeof(float)*nWidth*nHeight, cudaMemcpyDeviceToHost, stream[i]);
		cudaStatus = cudaEventRecord(stop[i], stream[i]);

And how about the stream execution time?

Yes, theoretically kernels from one stream can run concurrently with kernels from another stream. There are a number of requirements that have to be met to actually witness this however, mostly around resource utilization of the kernels. In practice, kernel concurrency is fairly hard to witness, in my experience. It requires a careful design. There is a concurrent kernels sample code you can test and study if you wish. Also, in your case, you are doing a depth-first launch. Since you have multiple operations (kernels, etc.) per stream, you may also want to experiment with a breadth-first launch.

I have no idea what this question means:

“And how about the stream execution time?”

Hi, txbob.

In my case, I use GPU to speed up image process: In one second, I need process 30 images of 1024*1024 pixels. But actually, processing one image need 45 milliseconds. I must investigate some way to speed up. I have tried some way, but no effects.

1st, I use two threams(One GPU Device) to process, one stream process odd images, another process even images. I think these two streams can be concurrent, but test result is not. Do you have some idea about stream concurrent?

2nd, Is there any way to observe how many GPU resources(such as GPU cores) were used when process image?

3rd, Is there any way to assign fixed cores to one kernel(or one stream), such as one kernel use 600 GPU cores, another use another 600 GPU cores?

4th, Do you have some idea to speed up GPU Process?

I think it’s unlikely that kernels written to do image processing will run concurrently. Many of these kernels will assign one thread per pixel, and with an image of any significant size, this will create enough threads and therefore blocks to “fill” a GPU, preventing any meaningful concurrency. It should be possible to get concurrency on data copying to/from GPU with GPU kernel execution.

Use a profiler.

CUDA doesn’t provide any built-in method to partition GPU HW resources.

Optimize the code?

I understand that kernels launched from two separate processes cannot run concurrently.

I am profiling two processes running on a TX2 and it appears that the kernels launched by the two processes are running concurrently. Can you help me understand why it is showing up that way in the profiler? I will try to attach an nvvp screenshot for your reference.

The two processes are running an identical sequence of kernels if that makes a difference.

Thanks in advance.

Newer CUDA versions allow for the time-slicing of work between two processes. It used to be that CUDA kernels from one process would run to completion before a context switch could happen.

In more recent versions of CUDA on Pascal hardware and higher, there is the possibility for interprocess time-sliced access to the GPU. The net effect of this is that it “appears” as if both kernels are running simultaneously, however what is happening under the hood is that first one kernel/process runs “for a while” then another kernel/process runs “for a while” etc.

(Read the UPDATE)

Your TX2 is a pascal-generation GPU.

Ahh…very interesting. The average duration of the same kernels “appear” to take about twice as long when running two processes simultaneously, which makes sense since the kernels are not able to execute in the same clock cycles. Thanks again!