Using NVTX to trace Python GIL with Nsight Systems

julian65 · August 7, 2023, 9:02pm

How can I see when the CPython GIL is held using Nsight Systems?

I’m trying to track the causes of intermittent latency spikes (200-400ms) in a multithreaded Python program that includes a combination of CPU-bound work in C extensions and interspersed async network/disk/serial I/O. As part of understanding these spikes, I’d like to understand when Python threads are waiting for or holding the CPython global interpreter lock. I’ve tried modifying CPython’s ceval_gil.h to track GIL usage using NVTX’s nvtxDomainRangePushEx and nvtxDomainRangePop, but my traces seem very suspicious – they show many overlapping GIL regions, while I expect only one thread to be holding the GIL at a time. As an alternative, I tried naming the gil mutex using nvtxDomainResourceCreate, but I couldn’t see it anywhere in Nsight Systems.

Unexpected Overlapping GIL Regions:

What I’ve tried with ranges:

I modified v3.10.4/Python/ceval_gil.h by replacing every call to MUTEX_LOCK(gil->mutex) and MUTEX_UNLOCK(gil->mutex) with GIL_MUTEX_LOCK(gil->mutex) and GIL_MUTEX_UNLOCK(gil->mutex) macros:

static nvtxDomainHandle_t GIL_NVTX_DOMAIN;
static nvtxStringHandle_t GIL_NVTX_WAITING_MSG;
static nvtxStringHandle_t GIL_NVTX_HOLDING_MSG;
#define GIL_MUTEX_LOCK(m) \
    nvtxDomainRangePushEx(GIL_NVTX_DOMAIN, &(nvtxEventAttributes_t){.message={.registered=GIL_NVTX_WAITING_MSG}, .color=0xFF0000FF, .version=NVTX_VERSION, .size=NVTX_EVENT_ATTRIB_STRUCT_SIZE, .messageType = NVTX_MESSAGE_TYPE_REGISTERED, .colorType=NVTX_COLOR_ARGB}); \
    MUTEX_LOCK(m); \
    nvtxDomainRangePop(GIL_NVTX_DOMAIN); \
    nvtxDomainRangePushEx(GIL_NVTX_DOMAIN, &(nvtxEventAttributes_t){.message={.registered=GIL_NVTX_HOLDING_MSG}, .color=0xFFFF0000, .version=NVTX_VERSION, .size=NVTX_EVENT_ATTRIB_STRUCT_SIZE, .messageType = NVTX_MESSAGE_TYPE_REGISTERED, .colorType=NVTX_COLOR_ARGB})
#define GIL_MUTEX_UNLOCK(m) \
    MUTEX_UNLOCK(m); \
    nvtxDomainRangePop(GIL_NVTX_DOMAIN)

static void _gil_initialize(struct _gil_runtime_state *gil)
{
    _Py_atomic_int uninitialized = {-1};
    gil->locked = uninitialized;
    gil->interval = DEFAULT_INTERVAL;
    GIL_NVTX_DOMAIN = nvtxDomainCreate("GIL NVTX Domain");
    GIL_NVTX_WAITING_MSG = nvtxDomainRegisterStringA(GIL_NVTX_DOMAIN, "Waiting for GIL");
    GIL_NVTX_HOLDING_MSG = nvtxDomainRegisterStringA(GIL_NVTX_DOMAIN, "Holding GIL");
}

What I’ve tried with resource naming:

I modified v3.10.4/Python/ceval_gil.h by changing create_gil and destroy_gil :

static nvtxResourceHandle_t GIL_NVTX_HANDLE;
static void create_gil(struct _gil_runtime_state *gil)
{
    MUTEX_INIT(gil->mutex);
    nvtxResourceAttributes_t attribs = {0};
    attribs.version = NVTX_VERSION;
    attribs.size = NVTX_RESOURCE_ATTRIB_STRUCT_SIZE;
    attribs.identifierType = NVTX_RESOURCE_TYPE_GENERIC_POINTER;
    attribs.identifier.pValue = (const void*)(&(gil->mutex));
    attribs.messageType = NVTX_MESSAGE_TYPE_ASCII;
    attribs.message.ascii = "GIL Mutex";
    GIL_NVTX_HANDLE = nvtxDomainResourceCreate(GIL_NVTX_DOMAIN, &attribs);
#ifdef FORCE_SWITCHING
    MUTEX_INIT(gil->switch_mutex);
#endif
    COND_INIT(gil->cond);
#ifdef FORCE_SWITCHING
    COND_INIT(gil->switch_cond);
#endif
    _Py_atomic_store_relaxed(&gil->last_holder, 0);
    _Py_ANNOTATE_RWLOCK_CREATE(&gil->locked);
    _Py_atomic_store_explicit(&gil->locked, 0, _Py_memory_order_release);
}

static void destroy_gil(struct _gil_runtime_state *gil)
{
    nvtxDomainResourceDestroy(GIL_NVTX_HANDLE);
    /* some pthread-like implementations tie the mutex to the cond
     * and must have the cond destroyed first.
     */
    COND_FINI(gil->cond);
    MUTEX_FINI(gil->mutex);
#ifdef FORCE_SWITCHING
    COND_FINI(gil->switch_cond);
    MUTEX_FINI(gil->switch_mutex);
#endif
    _Py_atomic_store_explicit(&gil->locked, -1,
                              _Py_memory_order_release);
    _Py_ANNOTATE_RWLOCK_DESTROY(&gil->locked);
}

hwilper · August 8, 2023, 2:28pm

What version of Nsight Systems are you using?

Doron’s team has added some additional python profiling support, some shipping in the current 2023.2 release and some additional shipping in the 2023.3 release, which is due to go out this week. I’m going to loop him in for suggestions.

@dofek

julian65 · August 8, 2023, 2:43pm

I was using nsight-systems-2023.1.1_2023.1.1.127-1_amd64.deb on the system being profiled and NsightSystems-macos-public-2023.1.1.127-3236574.dmg for the UI.

julian65 · August 8, 2023, 3:53pm

I’ve just switched to NVIDIA Nsight Systems version 2023.3.1.92-233133147223v0 on the system being profiled and 2023.3.1.92-233133147223v0 OSX for the UI. I’m still seeing the same overlapping GIL problem with my attempt at using nvtxDomainRangePushEx and nvtxDomainRangePop.

Guy_Sz · August 9, 2023, 5:33pm

MUTEX_LOCK and MUTEX_UNLOCK are not the correct places to generate NVTX ranges for your purpose.
Notice that inside drop_gil(), MUTEX_LOCK(gil->mutex) gets called by the thread that currently holds the GIL. Then, GIL_MUTEX_LOCK pushes a (wrong) NVTX range with “waiting on GIL” message.
Also, at the end of take_gil(), MUTEX_UNLOCK(gil->mutex) gets called, and GIL_MUTEX_UNLOCK pops NVTX range, but it shouldn’t, because the GIL is still held.

Here is a patch to ceval_gil.h that should provide your desired behavior.
gil_nvtx_trace.diff (2.4 KB)

I see from the screenshot that Nsight Systems’ Python sampling feature is enabled. This feature can bring you additional insights for your profiling efforts. For example, if a C extension is holding the GIL for a long time, then you’ll see a long period where Python is in “uninterruptible” state.
Please notice that this feature creates an additional Python thread that using the GIL, it might be better to disable it while using your GIL tracing.

julian65 · August 10, 2023, 6:34pm

Thank you for the patch; that’s really helpful! Using it, I’m no longer seeing any overlapping “Holding GIL” messages, which is really promising. If everything is working correctly, my understanding is that the trace below shows the GIL being held on an unscheduled thread for several hundred milliseconds, while most other threads are in a “waiting” state.
I am seeing some overlapping “Waiting for GIL” ranges.
Am I correct in thinking that this is expected behavior?

Guy_Sz · August 12, 2023, 9:03pm

Yes, it seems that the GIL tracing is working properly. It is reasonable to see multiple threads waiting on the GIL at the same time.

In Python, by default, the GIL switch interval is 5 ms. So a thread holding the GIL for ~200 ms can definitely point to a bottleneck in the application. Also notice that there are no calls to OS runtime functions during the time that the GIL is held. Seems like the thread is performing a relatively long CPU bound operations (in C level) while holding the GIL and blocking the other threads.

julian65 · August 14, 2023, 10:33pm

Thanks!

Seems like the thread is performing a relatively long CPU bound operations (in C level) while holding the GIL and blocking the other threads.

I wish this were the case, but it doesn’t seem to be; the thread is unscheduled for a long time and is stuck before it can do anything in C. At this point I think it has something to do with fdatasyncs in other threads; when I stop writing to disk, these regions of an unscheduled thread holding the GIL for a long time disappear. I don’t know why yet, but thank you for helping me trace the GIL!

Guy_Sz · August 16, 2023, 8:44am

You are welcome. Good luck!

system · August 30, 2023, 8:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about profiling nccl kernels with Nsight Compute Nsight Compute	20	4965	February 13, 2025
How to control profiling start time using Nsight System gui like --capture-range=cudaProfilerApi in cli Profiling Linux Targets nsight	12	3933	April 4, 2023
Using Nsight Compute to Inspect your Kernels Technical Blog	2	1685	August 31, 2020
NVIDIA Tools Extension API (NVTX): Annotation Tool for Profiling Code in Python and C/C++ Technical Blog	1	618	October 17, 2022
Nsight Profiler Hangs on OpenMP Initialization Profiling Linux Targets profiling	9	1209	February 29, 2024
Pipeline freeze? DeepStream SDK	15	709	May 27, 2024
How to profile NvEglRenderer under Orin? Jetson AGX Orin opengl	8	58	November 7, 2024
Generating CUPTI_* tables with nsys Profiling Linux Targets cuda	25	1684	January 12, 2023
Issue with Sending Custom Events Using Python DeepStream SDK	6	393	June 20, 2024
[QuadDCommon::tag_message*] = No GPU associated to the given UUID Profiling Linux Targets	24	971	November 5, 2024

Using NVTX to trace Python GIL with Nsight Systems

Related topics