Cuda stalls inexplicably

I have many different cuda kernels being executed in a code path I am optimizing. I just narrowed down an apparant and significant stall in Cuda and I was wondering if anyone had any ideas. Initially it looked like a set of CudaFree calls of relatively small arrays (10k or so) were taking almost 6 seconds. But when I commented these calls out the stall just moved to the next Cuda call. I’m checking errors on every cuda call; they are all running clean.

One potential thing; I am using ogl interop calls and I’m switching to a different gl context (actually an FBO), doing some more cuda calls in there and then switching back to the main gl context. This is all in the same thread so I figured it should not matter.

Does anyone have any ideas on what to try?