I need to get device events at run time in a Python application that uses CUDA to run on GPUs.
Right now, I have a C library that works well for C applications running CUDA code. The C library spawns a thread that continuously samples the device where the CUDA code is executed.
I tried to use this C library to sample the Python code (using pycuda). But it doesn’t work. The application doesn’t crash and finishes as expected, but the events reported by the sampling C thread are 0 always.
I haven’t seen any CUPTI for Python. So, I don’t know how I’m supposed to sample Python codes. Any clue about making the C library work with Python code?
My end goal is to sample tensorflow applications. But I cannot even sample a toy example.
This can happen when CUPTI based sampling library and the underlying CUDA code from the Python run in the different processes.
Another reason could be event sampling might be happening on a different CUDA context than the one application is using. Most of the CUPTI events can be collected at the CUDA context level only. It can be queried using the API cuptiEventGetAttribute() and attribute type as CUPTI_EVENT_ATTR_PROFILING_SCOPE. Refer enum CUpti_EventProfilingScope for all scopes.
I’m using a C library to spawn a thread on the application. This thread is responsible for sampling the CUDA_VISIBLE_DEVICE that is set.
This method works for all the CUDA applications written in C/C++ that I tested.
When going to a Python code, I use ctypes to use the same C library to spawn the sampling thread.
But it doesn’t work.
Here is the snippet to initialize the sampling thread from my Python code:
Since they are on the same process, the CUDA context should be the same for both threads (the main thread doing useful work and the sampling thread).
NVProf is able to profile events with my python code. But the readings are only reported when the application finishes and I need to read events at runtime.
Can you please cross check that CUDA context used in both the threads - the main thread and the sampling thread is same? CUDA driver API cuCtxGetCurrent() or the corresponding pycuda interface can be used.
you were correct. The context when using C and Python were different.
Now, I’m intercepting (LD_PRELOAD mechanism) the cuCtxSetCurrent function to capture the context that is being used.