MPI and optixPrime

I successfully create a program using optixPrime with a combination of:

  1. call to optiXprime
  2. do stuff that are computationally expensive en CPU
  3. go to 1) until I converge.

I now want to MPi-ize my code. There are 3 reasons for that.

-A) I want to speed up my computation using more core for part 2) , using all the core available on a machine
-B) I want to speed up my computation across multiple machines
-C) I want to be able to use more memory

And, I am aware that optixPrime detect the number of core on your local machine and launch the corresponding number of threads.
I am considering 2 approaches:

Approach 1)

  1. make load balancing
    1a) call to optiXprime on different machine, but on thead 0
    1b) recover information from 1a) and send it to all the other thread locally on the machine
    2a) run on each CPU, for all thread on all machine stuff that are computationally expensive en CPU
    2b) for each machine, for each thread, send information to thread 0 (so optixPrime can be run on it)
  2. Communicate convergence information between all machine and check if convergence is achieved, if not go back to 1)

Approach 2)

  1. make load balancing
  2. call to optiXprime on different machine, for all threads
  3. run on each CPU, for all thread on all machine stuff that are computationally expensive on CPU
  4. communicate convergence information between all threads and all machines and check if convergence is achieved, if not go back to 1)

I would rather go with approach 2) to minimize communication but approach 1) might be more interesting for memory reasons (question C)

So, finally, here are my questions: I would like to know:

A) If I am launching several instance of optixPrime on the same machine, is there a way to limit the number of cores launched by an instance of optixPrime ? (Approach 2) )
Edit: What I want to do is to limit the number of instance to 1 thread so avoid launching N*N thread of optix Prime, N being the number of cores (In the case I launch my application with mpirun -np N myApplication)
B) If I want to use MPi accross multiple machines, do I need to use the same GPU ? or is using the same optixPrime version as well as the same cuda drivers is enough ?
C) If I am launching several instance of optixPrime on my machine, is this possible to share some of the geometry structure to minimize my memory overhead. (I looks very unlikely, but maybe that is the case ?)

Some an additional information: I do not want to go to optix for now (and use full CUDA) because of the memory limitation inerrant to the GPU. I foresee that if I do a GPU implementation, the only way I can fit all my data is in a 4Go Gpu or more and as a significant part of that data needs to be written by all the threads (and there are no way I can separate the domain since everything is interconnected), I would need to use atomic operation all the time which will - according to what I remember - annihilate all the advantage of using the GPU.
In the case I use a CPU implementation, each thread will have enough memory to be run and I can perform the necessary post-treatment to addition the sampled data coming from all my threads.
I could consider to use that approach on GPU as well, but I foresee a requirement of 100Mo of data to be written per thread, so it will limit quite a lot the number of “thread” I would launch on a GPU if I did not want to use atomic operation. (I foresee at least 2 Go or more of overhead due to the geometry and the geometry structure)
If my understanding is not correct, please correct me.