I need to get device events at run time in a Python application that uses CUDA to run on GPUs.
Right now, I have a C library that works well for C applications running CUDA code. The C library spawns a thread that continuously samples the device where the CUDA code is executed.
I tried to use this C library to sample the Python code (using pycuda). But it doesn’t work. The application doesn’t crash and finishes as expected, but the events reported by the sampling C thread are 0 always.
I haven’t seen any CUPTI for Python. So, I don’t know how I’m supposed to sample Python codes. Any clue about making the C library work with Python code?
My end goal is to sample tensorflow applications. But I cannot even sample a toy example.
Thank you.