Unstable runtime length with clara-parabricks:4.0.0-1

Hello,I used two RTX4090s to run the clara-parabricks:4.0.0-1 program. At first, when running a sample, GPU-BWA mem and GPU-GATK4 HaplotypeCaller ran very fast, basically the running time was in ten minutes. This is a Satisfactory speed.
However, as the number of samples increases, its speed will gradually decrease, and the GPU occupancy rate of the corresponding program is also very low.It runs for hours, which is not ideal.

It’s not clear to me what is causing the difference in runtime. Actually, I might need a faster run because I have a lot of samples to process.

Looking forward to getting your advice!