I work with a server with 2 AMD EPYC 7343 16-Core Processor and 3 A40 GPU under Linux.
When i launch multiple simulation, aka 1 simulation per GPU, I am facing a performance collapse. For example by running the same 3 simulations, the calculation times are both bad and unstable.
To avoid any interference, I tested the same code without any disk writing, the performance is similar and problematic.
I would add that in my case, there is no bottle neck with CPU/GPU memory usage.

Any one face the same problem?

the two CPU system has two NUMA nodes with distinct memory banks. And I assume the PCIe connections are also connected to specific CPUs. Anything else has to travel though an interconnect between the CPU - which may under some circumstances be a bottleneck.

One thing to look out for is to make host memory allocations memory associated with the same NUMA node that the respective PCIe slot is connected to directly.

Is your computation involving a lot of memory transfers to/from individual GPUs?

In my case, their is nearly no memory GPU/CPU transfer. Only launch kernel in loops. Only once in a while, the GPU data is transferred for backup. My code uses unified memory. Maybe that’s the problem with the dual CPU/numa?

launching kernels in a situation that uses unified memory will usually cause some H->D data movement during kernel activity.

Entire papers have been written about this topic.

I wonder if disabling one NUMA node (CPU) could already give insights into any performance aspects related to NUMA in your case.

Some more ideas: lock CPU processes/threads of each simulation to a specific CPUs using affinity masks

I’ve take look on the server topologie, and here is the problem, the linux scheduler switch btw CPU on fly.

Using taskset command and CUDA_VISIBLE_DEVICES to force good/bad CPU/GPU affinity :

  • good 2.5ms/iteration
  • bad 7 to 10 ms/iteration

Nearly 4 time faster with just the good launch parameters.

