CUPTI with python

I need to get device events at run time in a Python application that uses CUDA to run on GPUs.

Right now, I have a C library that works well for C applications running CUDA code. The C library spawns a thread that continuously samples the device where the CUDA code is executed.

I tried to use this C library to sample the Python code (using pycuda). But it doesn’t work. The application doesn’t crash and finishes as expected, but the events reported by the sampling C thread are 0 always.

I haven’t seen any CUPTI for Python. So, I don’t know how I’m supposed to sample Python codes. Any clue about making the C library work with Python code?

My end goal is to sample tensorflow applications. But I cannot even sample a toy example.

Thank you.

This can happen when CUPTI based sampling library and the underlying CUDA code from the Python run in the different processes.

Another reason could be event sampling might be happening on a different CUDA context than the one application is using. Most of the CUPTI events can be collected at the CUDA context level only. It can be queried using the API cuptiEventGetAttribute() and attribute type as CUPTI_EVENT_ATTR_PROFILING_SCOPE. Refer enum CUpti_EventProfilingScope for all scopes.

For quick reference, please check metrics table at https://docs.nvidia.com/cupti/Cupti/r_main.html#metrics-reference. Scope value “Multi-context” in the able indicates that underlying event/s are collected at the context level.

Is nvprof able to profile events with your python code?

I’m using a C library to spawn a thread on the application. This thread is responsible for sampling the CUDA_VISIBLE_DEVICE that is set.
This method works for all the CUDA applications written in C/C++ that I tested.

When going to a Python code, I use ctypes to use the same C library to spawn the sampling thread.
But it doesn’t work.

Here is the snippet to initialize the sampling thread from my Python code:

import ctypes                                                                                                                                                                                                   
func = ctypes.CDLL("libutils.so")                                                                                                                                                                               
func.spawnCUPTI()

Since they are on the same process, the CUDA context should be the same for both threads (the main thread doing useful work and the sampling thread).

NVProf is able to profile events with my python code. But the readings are only reported when the application finishes and I need to read events at runtime.

Hi cortega

Can you please cross check that CUDA context used in both the threads - the main thread and the sampling thread is same? CUDA driver API cuCtxGetCurrent() or the corresponding pycuda interface can be used.

Hi mjain,

you were correct. The context when using C and Python were different.
Now, I’m intercepting (LD_PRELOAD mechanism) the cuCtxSetCurrent function to capture the context that is being used.

Thank you.

Good to know that you are able to root cause the issue. Thanks!