I’m trying to instrument (i.e., insert performance measurement code) into a PyTorch-based deep learning workflow (including the transformers library) to measure GPU performance, but I’ve encountered several issues. Here’s my current approach and the problems I’m facing:
-
I wrote a custom library and used LD_PRELOAD to intercept cudaSetDevice (and some other CUDA Runtime APIs). The idea is to invoke CUPTI (PM Sampling or Activity API) from within these intercepted functions in order to start collecting performance metrics.
-
In a pure CUDA program (no PyTorch), this Hook approach works fine: I can intercept cudaSetDevice, call CUPTI APIs, and I don’t encounter any error messages.
-
However, in the PyTorch environment (especially when loading large models from the transformers library), this instrumentation method frequently leads to conflicts or errors. For instance, CUPTI might report CUPTI_ERROR_UNKNOWN(999), or the Python process can end up hanging and failing to exit cleanly.
-
I suspect the root causes might include:
• PyTorch’s GPU initialization process is complex—potentially involving multi-threading, multi-processing, or low-level Driver API calls (e.g. cuDevicePrimaryCtxRetain). Simply intercepting cudaSetDevice does not cover all possible code paths.
• My Hook likely calls CUPTI too early, before PyTorch has fully set up the CUDA environment, causing context or timing conflicts.
• There could be library version conflicts (e.g., issues with OpenMP / libgomp), resulting in symbol lookup failures if different parts of the environment expect different OpenMP implementations.
I’d like to ask the community:
-
Has anyone successfully instrumented CUPTI via LD_PRELOAD in a PyTorch program? Which API calls are critical to intercept, and how do you ensure that the instrumentation does not conflict with PyTorch’s internal setup?
-
Would it be better to use a “proxy library” approach (renaming libcudart.so and creating a middle layer) or to write a PyTorch C++ Extension that manually initializes CUPTI after the script has started?
-
If anyone here has practical experience combining PyTorch + CUPTI, do you have recommended best practices or pitfalls to be aware of?
My ultimate goal is to automatically start CUPTI performance measurement in a PyTorch training/inference workflow right before the main computations begin, and then stop collecting data at either program termination or a specific phase, so that I can gather meaningful GPU performance metrics without manually modifying the PyTorch codebase. Any insights or suggestions would be greatly appreciated!