I have a gpgpu program using the driver api that does some of the pre-calculation on the cpu. In this regard, the gpu load is about 80%. And this cannot be changed. Therefore, I made it so that two copies of processes/threads were launched on one gpu. At the same time, the gpu load increased to 100% and, in connection with this, the calculation speed increased. However, I recently tested this application on an RTX 4070ti and noticed that despite the fact that the load on the gpu when starting two processes/threads still increases, the computational speed is less than when one process/thread is started.
This is weird. Perhaps now kernels from different processes are being executed in a time-sliced rather than round-robin behavior?
How can this be explained? And how to avoid this performance degradation?
Update. Using Nsight Compute on RTX4070 i can see that when one process is executed on GPU, kernel duration is 0,98 msec. When other process started on the same GPU in parallel, then duration of the kernel execution in the first process is 2,25 msec.
On RTX2070 kernel duration does not change if two processes working or the only one.
It seems that on RTX4070 kernels are executed in a time-sliced manner. And maybe that’s why, though gpu load growth, the calculation speed decreased?
Is it possible to change this behavior.
Any thougths ? Somebody?
You don’t have any direct control over whether time-slicing or round-robin is used as a context-switch/sharing mechanism when multiple processes are used.
It’s certainly possible or imaginable for me that if you observed a round-robin kernel execution scheme, and compared it to a time-sliced context switching scheme, that the wallclock kernel duration itself could be longer in the time sliced case.
I guess if I were dealing with this, I would seek to either issue the work from the same process, or else to use MPS.
To be clear, I have no idea what is happening in your case. It’s not possible to discern anything from the information provided.