QueryPerformanceCounter results of CUBLAS vs. MKL implementation of same matrix operations

I have used this windows function to time the required duration of these computations in an application containing two threads (although one of the threads is set to refresh at every 10000 seconds). The CUBLAS time looked relatively unstable with loads of spikes as shown in the illustration. My guess could be b/c of switching between threads screwing up the CUDA context access. Does anyone here perhaps have a better explanation of what’s happening? Thanks in advance.
scaled.bmp (2.07 MB)

Use “SetThreadAffinityMask(GetCurrentThread(), CPU_MASK)” to NAIL the thread to a CPU… Do this b4 starting your PERFORMANCE COunters…

So, if you have 2 CPUs, the CPU_MASK for first thread thread would be 1 and second thread would be 2.

If oyu havd 4 CPUs, the CPU_MASK for 4 CPU threads could be set to “1, 2, 4, 8”. HTH