PCIe RX throughput rises quickly when using MPS

When I use MPS to run some processes concurrently, PCIe RX throughput rises quickly as the number of concurrent processes increases.
I use the vectorAdd in CUDA samples and changed a little code so that the load can run for enough time.

for (int i=0; i<=5000000; i++)
        vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

From 2 to 3 processes, the growth of PCIe is very strange. This also happens in other workloads.
I run these on T4.
So what happened, is this normal?

The kernel launch process necessarily involves some communication over PCIE. Increasing the communication/kernel launch load will increase the PCIE traffic. Kernel launches are put into a queue, and when the queue is full, the CPU thread will wait (in the midst of your for-loop). Then, as the queues empty, the CPU thread will continue through the loop. Putting this over multiple devices, and also using MPS, will likely result in all sorts of hard-to-predict, non-linear behavior.

1 Like