TLP in case on multiple cpu/gpu processes

Running an mpi process shows that 4 processes are runninig each utilizes 100% of a core. At the same time, nvidia-smi also shows 4 processes are using GPU.

I want to know does that mean, each cpu process binds to only one gpu process? Does that mean each GPU process has one thread? In case of a thread stall, does is switch to another GPU process?

In general, how TLP is explained here? Any idea?


11569 mahmood 20 0 29.009g 156.0m 91.3m R 100.3 1.0 2:23.75 lmp_mpi
11570 mahmood 20 0 29.009g 155.8m 91.1m R 100.3 1.0 2:23.78 lmp_mpi
11568 mahmood 20 0 29.009g 156.1m 91.4m R 100.0 1.0 2:23.74 lmp_mpi
11571 mahmood 20 0 29.009g 156.1m 91.4m R 100.0 1.0 2:23.74 lmp_mpi

| 0 11568 C /opt/lammps-11Aug17/src/lmp_mpi 334MiB |
| 0 11569 C /opt/lammps-11Aug17/src/lmp_mpi 334MiB |
| 0 11570 C /opt/lammps-11Aug17/src/lmp_mpi 334MiB |
| 0 11571 C /opt/lammps-11Aug17/src/lmp_mpi 334MiB |

the two listings you are showing from top and nvidia-smi are showing you the same thing. There is no such thing as a GPU process.

You have 4 CPU processes, 11568, 11569, 11570, and 11571. Those same 4 processes are being listed in your top output as well as your nvidia-smi output.

Each of those 4 CPU processes is using GPU 0.

There is no indication that the CPU processes consist of only one CPU thread (they could in theory contain multiple CPU threads, and the listings you have shown would not look any different) although they might, and for the kernels those CPU processes are launching on GPU 0, it is certainly not the case that they consist of only 1 GPU thread.

With multicore processor we run mpi job so multiple processes are launched and each process is bound to a core. For example, with “-np 4”, four processes execute foo() in the code and each process gets a specific amount of data. That means, each core is running foo() for different data.

Now, the same question exists for GPU. Assume 2000 cuda cores are there on the device. If I use the command line for an mpi job that launches 4 processes on the cpu, the 4 cores want to launch a kernel foo(). First process offloads foo() on the GPU and assume it uses all 2000 cores. Then process 2 want to execute foo() and GPU is switched to the second process. It again uses 2000 cores and so on.

It seems that such approach is not as good as cpu version because kernels launched by cpu processes are serialized, sort of…

I read but couldn’t find the answer to my question. Maybe I missed something. Any thought?

A good strategy would be to construct your MPI application so that it uses one GPU per rank. if you have 2 GPUs, you launch 2 ranks, and so on.

It is possible to have multiple ranks share a single GPU. In that case CUDA MPS may be of interest.

Hi again
Running lammps, I tried “mpirun -np 16” while MPS is off and the program runs without any error. When I turn on the MPS, I am not able to run with “-np 4” and more. “-np 2” works. The error is about insufficient memory.

Is that normal?

Sounds odd.
I tried another program, gromacs. Without MPS, all memory usage is about 200MB while the memory size of GPU is 4GB. When I turn on the MPS, it quickly fails at the beginning of the run with out of memory error.

How to determine if the error is related to GPU or the application itself?