How to use cuda graph more effienty

I’d like to use cuda graph for my ML engine. some input and output params( input & output pointers) may change frequently. so I have to create many cuda graphs but I don’t want to. because the number of IO is so many.

I read Employing CUDA Graphs in a Dynamic Environment | NVIDIA Technical Blog this article, which provides two ways to use cuda graph.
It seems updating cuda graph may work with my situation, but there is problem with it’s performance.

my question is, is there another way to use (or change) cuda graph effiently? or Must I create enough cuda graphs.

Dear @JeremyYuan,
Just want to know if you are trying to generate a TRT engine from ONNX and planned to use cudaGraphs ? If so, did you check useCudaGraph flag with trtexec? Could you give some details about the your case?