Using NVTX to trace Python GIL with Nsight Systems

How can I see when the CPython GIL is held using Nsight Systems?

I’m trying to track the causes of intermittent latency spikes (200-400ms) in a multithreaded Python program that includes a combination of CPU-bound work in C extensions and interspersed async network/disk/serial I/O. As part of understanding these spikes, I’d like to understand when Python threads are waiting for or holding the CPython global interpreter lock. I’ve tried modifying CPython’s ceval_gil.h to track GIL usage using NVTX’s nvtxDomainRangePushEx and nvtxDomainRangePop, but my traces seem very suspicious – they show many overlapping GIL regions, while I expect only one thread to be holding the GIL at a time. As an alternative, I tried naming the gil mutex using nvtxDomainResourceCreate, but I couldn’t see it anywhere in Nsight Systems.

Unexpected Overlapping GIL Regions:

What I’ve tried with ranges:

I modified v3.10.4/Python/ceval_gil.h by replacing every call to MUTEX_LOCK(gil->mutex) and MUTEX_UNLOCK(gil->mutex) with GIL_MUTEX_LOCK(gil->mutex) and GIL_MUTEX_UNLOCK(gil->mutex) macros:

static nvtxDomainHandle_t GIL_NVTX_DOMAIN;
static nvtxStringHandle_t GIL_NVTX_WAITING_MSG;
static nvtxStringHandle_t GIL_NVTX_HOLDING_MSG;
#define GIL_MUTEX_LOCK(m) \
    nvtxDomainRangePushEx(GIL_NVTX_DOMAIN, &(nvtxEventAttributes_t){.message={.registered=GIL_NVTX_WAITING_MSG}, .color=0xFF0000FF, .version=NVTX_VERSION, .size=NVTX_EVENT_ATTRIB_STRUCT_SIZE, .messageType = NVTX_MESSAGE_TYPE_REGISTERED, .colorType=NVTX_COLOR_ARGB}); \
    MUTEX_LOCK(m); \
    nvtxDomainRangePop(GIL_NVTX_DOMAIN); \
    nvtxDomainRangePushEx(GIL_NVTX_DOMAIN, &(nvtxEventAttributes_t){.message={.registered=GIL_NVTX_HOLDING_MSG}, .color=0xFFFF0000, .version=NVTX_VERSION, .size=NVTX_EVENT_ATTRIB_STRUCT_SIZE, .messageType = NVTX_MESSAGE_TYPE_REGISTERED, .colorType=NVTX_COLOR_ARGB})
#define GIL_MUTEX_UNLOCK(m) \
    MUTEX_UNLOCK(m); \
    nvtxDomainRangePop(GIL_NVTX_DOMAIN)

static void _gil_initialize(struct _gil_runtime_state *gil)
{
    _Py_atomic_int uninitialized = {-1};
    gil->locked = uninitialized;
    gil->interval = DEFAULT_INTERVAL;
    GIL_NVTX_DOMAIN = nvtxDomainCreate("GIL NVTX Domain");
    GIL_NVTX_WAITING_MSG = nvtxDomainRegisterStringA(GIL_NVTX_DOMAIN, "Waiting for GIL");
    GIL_NVTX_HOLDING_MSG = nvtxDomainRegisterStringA(GIL_NVTX_DOMAIN, "Holding GIL");
}

What I’ve tried with resource naming:

I modified v3.10.4/Python/ceval_gil.h by changing create_gil and destroy_gil :

static nvtxResourceHandle_t GIL_NVTX_HANDLE;
static void create_gil(struct _gil_runtime_state *gil)
{
    MUTEX_INIT(gil->mutex);
    nvtxResourceAttributes_t attribs = {0};
    attribs.version = NVTX_VERSION;
    attribs.size = NVTX_RESOURCE_ATTRIB_STRUCT_SIZE;
    attribs.identifierType = NVTX_RESOURCE_TYPE_GENERIC_POINTER;
    attribs.identifier.pValue = (const void*)(&(gil->mutex));
    attribs.messageType = NVTX_MESSAGE_TYPE_ASCII;
    attribs.message.ascii = "GIL Mutex";
    GIL_NVTX_HANDLE = nvtxDomainResourceCreate(GIL_NVTX_DOMAIN, &attribs);
#ifdef FORCE_SWITCHING
    MUTEX_INIT(gil->switch_mutex);
#endif
    COND_INIT(gil->cond);
#ifdef FORCE_SWITCHING
    COND_INIT(gil->switch_cond);
#endif
    _Py_atomic_store_relaxed(&gil->last_holder, 0);
    _Py_ANNOTATE_RWLOCK_CREATE(&gil->locked);
    _Py_atomic_store_explicit(&gil->locked, 0, _Py_memory_order_release);
}

static void destroy_gil(struct _gil_runtime_state *gil)
{
    nvtxDomainResourceDestroy(GIL_NVTX_HANDLE);
    /* some pthread-like implementations tie the mutex to the cond
     * and must have the cond destroyed first.
     */
    COND_FINI(gil->cond);
    MUTEX_FINI(gil->mutex);
#ifdef FORCE_SWITCHING
    COND_FINI(gil->switch_cond);
    MUTEX_FINI(gil->switch_mutex);
#endif
    _Py_atomic_store_explicit(&gil->locked, -1,
                              _Py_memory_order_release);
    _Py_ANNOTATE_RWLOCK_DESTROY(&gil->locked);
}

What version of Nsight Systems are you using?

Doron’s team has added some additional python profiling support, some shipping in the current 2023.2 release and some additional shipping in the 2023.3 release, which is due to go out this week. I’m going to loop him in for suggestions.

@dofek

1 Like

I was using nsight-systems-2023.1.1_2023.1.1.127-1_amd64.deb on the system being profiled and NsightSystems-macos-public-2023.1.1.127-3236574.dmg for the UI.

I’ve just switched to NVIDIA Nsight Systems version 2023.3.1.92-233133147223v0 on the system being profiled and 2023.3.1.92-233133147223v0 OSX for the UI. I’m still seeing the same overlapping GIL problem with my attempt at using nvtxDomainRangePushEx and nvtxDomainRangePop.

MUTEX_LOCK and MUTEX_UNLOCK are not the correct places to generate NVTX ranges for your purpose.
Notice that inside drop_gil(), MUTEX_LOCK(gil->mutex) gets called by the thread that currently holds the GIL. Then, GIL_MUTEX_LOCK pushes a (wrong) NVTX range with “waiting on GIL” message.
Also, at the end of take_gil(), MUTEX_UNLOCK(gil->mutex) gets called, and GIL_MUTEX_UNLOCK pops NVTX range, but it shouldn’t, because the GIL is still held.

Here is a patch to ceval_gil.h that should provide your desired behavior.
gil_nvtx_trace.diff (2.4 KB)

I see from the screenshot that Nsight Systems’ Python sampling feature is enabled. This feature can bring you additional insights for your profiling efforts. For example, if a C extension is holding the GIL for a long time, then you’ll see a long period where Python is in “uninterruptible” state.
Please notice that this feature creates an additional Python thread that using the GIL, it might be better to disable it while using your GIL tracing.

Thank you for the patch; that’s really helpful! Using it, I’m no longer seeing any overlapping “Holding GIL” messages, which is really promising. If everything is working correctly, my understanding is that the trace below shows the GIL being held on an unscheduled thread for several hundred milliseconds, while most other threads are in a “waiting” state.
I am seeing some overlapping “Waiting for GIL” ranges.
Am I correct in thinking that this is expected behavior?

Yes, it seems that the GIL tracing is working properly. It is reasonable to see multiple threads waiting on the GIL at the same time.

In Python, by default, the GIL switch interval is 5 ms. So a thread holding the GIL for ~200 ms can definitely point to a bottleneck in the application. Also notice that there are no calls to OS runtime functions during the time that the GIL is held. Seems like the thread is performing a relatively long CPU bound operations (in C level) while holding the GIL and blocking the other threads.

Thanks!

Seems like the thread is performing a relatively long CPU bound operations (in C level) while holding the GIL and blocking the other threads.

I wish this were the case, but it doesn’t seem to be; the thread is unscheduled for a long time and is stuck before it can do anything in C. At this point I think it has something to do with fdatasyncs in other threads; when I stop writing to disk, these regions of an unscheduled thread holding the GIL for a long time disappear. I don’t know why yet, but thank you for helping me trace the GIL!

You are welcome. Good luck!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.