Parallell Processing for CPU + GPU with & without CUDA usage

We have processes using both CPU & NVIDIA GPU (CUDA). For simplicity assume the process is one large function that takes in X inputs. Right now it goes like this:

inputs = [X1, X2, X3,…]

for X in inputs:

output = function(X)

We need to parallelize this such that function(X1), function(X2), function(X3) and so forth begin to run simultaneously as opposed to function(X2) waiting for function(X1) to finish as it currently does in the for loop.

function(X) processes text & images. function(X) will evolve. But no AI done on images right now. AI only on text

Key us using full computing power (RAM/GPU) each time. Inputs will scale and adjusting parallelization parameters each time is not sustainable.

System Specs:

  1. Ubuntu/Linux
  2. Python 3.8
  3. CUDA + Tensoflow being used.
  4. Not a certainty that CUDA will always be used

Dont want to rely on a cloud event based service to do this. Want to do it natively on OS Below approaches being considered, but advice would be appreciated

*(Multi-processing option): Multiprocessing vs. Threading in Python: What you need to know.

Could you elaborate more background information:

How are these three applications deployed? Bare-metal with one GPU on one Server or three VMs with three GPUs?
Are these three applications running on one GPU? Can these three run independently?

The “Multiprocessing vs. Threading in Python: What you need to know” mentioned above is a single process multiple thread(streams) approach that can bring the best performance.

If multiple process concurrency is needed for a non-virtualization enviroment, MPS is an option. Please refer to this link for more information: [Multiprocessing vs. Threading in Python: What you need to know.]https://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf

hi, sorry for delayed response

  1. Bare-metal, one server. We are the only consumer for that server
  2. What do you mean be “three applications”? There is only 1 application. function(X1), function(X2), function(X3) are examples of 3 different inputs into the function that will be run in a for-loop. There will be far more than X3…likely hundreds going to function(X100). However all inputs will be in the same data structure of course.

We’re investigating MPS as multiple process concurrency is needed for a non-virtualization enviroment. Can you confirm that parallelizing this to an indefinite # of processes on the GPU will be ok? We don’t want to re-set the parameters if 50 processes (of the same function(X)) need to bne parallelized instead of 3. The number of required parallelizations will change hourly

What is the Function? CUDA Gemm Kernel? Thread-Level program? Process-level program?

I believe its a process-level program. Pseudo-code below:

inputs = [{“field 1: “string”, “field 2” : [array], “field 3: {“field 3.a”: “string”, “field 3.b” : string”}},
{“field 1: “string”, “field 2” : [array], “field 3: {“field 3.a”: “string”, “field 3.b” : string”}},
{“field 1: “string”, “field 2” : [array], “field 3: {“field 3.a”: “string”, “field 3.b” : string”}},
{“field 1: “string”, “field 2” : [array], “field 3: {“field 3.a”: “string”, “field 3.b” : string”}}]

^inputs array will have ever expanding number of elements- each element will have same structure

for input in inputs:
	function(input)

previously I wrote output = function(input). That’s misleading- the function doesn’t return anything.
Function is actually a several functions/methods that must execute sequentially. So real code is :

for input in inputs:
     output1 = function1(input) #pre-processing (CPU heavy)
	 output2 = function2(output1) #read data from MongoDB
	 output3 = function3(output1 , output2) #processing (GPU heavy)
	 function4(output3) #insert new data into MongoDB

Functions 1-4 must run sequentially for each input, but we want to parallelize the for-loop so the function1-4 process for each input kicks off either together or in rapid succession without waiting for function4 for each preceding input element/index to finish.

**when comparing to my original post, each input (input1, input 2, etc) is X1, X2, etc

Considering you own the source code and these are inside on process(program), I would recommend using multiple threads to launch the sequential functions with different CUDA Streams. Therefore the multiple threads(CUDA Stream) can execute on GPU concurrently.

Back to the last question you mention about MPS - Yes, MPS supports multiple clients(launch), The pre-Volta MPS Server supports up to 16 client CUDA contexts per-device concurrently. Volta MPS server supports 48 client CUDA contexts per-device.