How to profile multiple tensorrt model inference simultaneously using CUPTI


I am currently using the CUPTI activity API in my program to profile TensorRT model inference. However, I have encountered a problem where several models may be launched simultaneously, and I want to be able to profile all of them.

As far as I understand, CUPTI is a low-level API, and my assumption is that it can only be initialized once per process. Therefore, I implemented a singleton class to effectively utilize the CUPTI activity API. Kindly correct me if my understanding is incorrect.

After calling the initialize function in the CUPTI singleton class, I tried concurrently launching multiple TensorRT inference threads. Although I was able to capture all kernel events from these models, I am facing difficulty in identifying which model each kernel event belongs to.

Would anyone be able to offer some guidance on how to resolve this matter? Thank you very much.

Hi zhi_xz,
All the kernel records have a ‘correlationId’ field, using which you can correlate to the API records.
The API records have a ‘threadId’ field which can be used to distinguish between APIs launched by each thread.
Additionally, you can also define NVTX ranges around each model and collect the ‘CUPTI_ACTIVITY_KIND_MARKER’ activity which can help to identify the kernel events.

Hi RahulDhoot,
Thank you for your prompt response.

Would you happen to have any sample code showcasing the methods you mentioned?

You can refer the sample ‘cupti_correlation’ to see how API and kernel records can be correlated.

I apologize, I am not fully understanding your point. From my understanding, using push/pop NVTX ranges will create a ‘CUPTI_ACTIVITY_KIND_MARKER’ activity that is capable of being captured by CUPTI.

// model 1
nvtxRangePushA("Model 1");
context_1->enqueue(model_info_.batchSize, mBuffers_->getDeviceBindings().data(),
                                  cuda_stream_, nullptr);
// model 2
nvtxRangePushA("Model 2");
context_2->enqueue(model_info_.batchSize, mBuffers_->getDeviceBindings().data(),
                                  cuda_stream_, nullptr);
// model 3
nvtxRangePushA("Model 3");
context_3->enqueue(model_info_.batchSize, mBuffers_->getDeviceBindings().data(),
                                  cuda_stream_, nullptr);

There are three TensorRT models mentioned in the code above, and I have defined three pairs of NVTX ranges around each model. However, I am still unsure about how to accurately match the kernels captured by CUPTI to the respective models. Would appreciate any guidance or suggestions on this matter. Thank you.

Hi zhi_xz,
When you capture NVTX ranges, CUPTI gives the activity record for markers which has the timestamp and the marker name.
The kernels launched within those NVTX ranges will also have the timestamps within the NVTX ranges.
This will be easy to visualize if you can plot the timeline based on the CUPTI records.

However, it is possible for multiple models to be launched within the same NVTX range. As a result, kernels from different models may fall within the same timestamp range which makes it difficult to match them with their respective models.