Very wierd threading behaviour

I’m struggling to see how this could be a CUDA issue to be honest but I’ve searched my code and I just can’t see what might be causing it. My application has a main thread and one worker thread per GPU (and at the moment only one GPU). The worker thread waits on an event (which indicates a work item is ready) and then processes the work item by copying data to GPU memory and launching a kernel. It then sets another event to indicate that the kernel has been launched. The main thread sets up a work item, sets the event to wake up the worker thread and then waits for the event to indicate a kernel launch. When I measure the time (in the worker thread) taken to launch the kernel it appears to be almost instantaneous (as expected since kernels are launched asynchronously). The wierd thing is that if I measure the time (in the main thread) between waking up the worker thread and the event indicating a kernel launch being set then it seems to take the entire execution time of the kernel! At first I suspected that another of my threads was getting time-sliced in between but I think I have eliminated this possiblity by commenting out most of CPU-based computation. Also the fact that it is very highly correlated to the execution time of the kernel seems very suspicious. I’m vaguely aware that waiting for events in Microsoft Windows requires a kernel mode transition. Is CUDA and/or the video driver preventing this from happening while the kernel is running? What else might I have missed? I’m using CUDA 2.3 on Windows 7 (64-bit). I’m pretty sure this problem wasn’t present when I originally wrote the code but that would probably have been under CUDA 1.1 and Windows XP Pro (64-bit). Any help would be very much appreciated.

I think I’ve found the problem and (of course) the situation wasn’t quite as I explained it. I think that when I originally wrote the code cudaEventRecord() caused the command queue to be flushed to the device and now it seems that the command queue is flushed on the first call to cudaEventQuery(). I was using cudaEventQuery() to effectively perform a cudaThreadSynchronise() but I’d convinced myself this couldn’t be where the delay was happening because I knew I had long gaps between kernel executions. Turns out the kernel hadn’t even started yet.