cuMemAlloc duration is too large cuda12.1.1 L4

Hi all, I ran an AI inference program on my L4 machine(Driver Version: 525.116.03 CUDA Version: 12.0) and found the inference time was unexpectedly huge time to time. I expected the infernece was faster than T4 cuda11.4 machine.
I profiled the program using the nsight systems. It shows that cuda HW is idle due to cuMemAlloc opertation.
Can anyone explain the observation? Thanks.

image

Hi,

Could you please share with us the TensorRT version with which you’re facing the issue and issue repro model and relevant steps/scripts.

Thank you.

ok. I am using tensorrt_version_8_6_0_12. But the model and sample code are disclosed to external parties.
I wrote an simple demo which just read a serial of png files and upload to device side then release.

here is the nsight systems report. cuMemAlloc function cost exceed 300ms very often. even though I did not use pinned memory, the memalloc on device is too slow. while I run the demo, there is no other process on the same card( another process running on the other gpu card).

deviceQuery shows the cuda driver version is 12.0. While I have no clue the reason L4 is slower than T4, I will upgrade to 12.1 later and check the latency problem.

Hi,

Are you still facing the same issue on 12.1?
Please share the issue repro model, relevant scripts and steps if you still face the same issue.

Thank you.