I had been using the “QueryPerformanceCounter()” APIs provided by Windows to assess performance. However, this API has the multi-core CPU drawback - where two profiling points in your code may have actually executed in different CPUs. Thus, this can give spurious performance numbers (I saw -ve times on my AMD dual-core while measuring short durations).
For this purpose, it would be a good idea to use “SetThreadAffinityMask” to tie a thread to 1 CPU (should hold good for MAIN thread as well).
I am not sure if this is applicable to the “event” record API from CUDA - Some1 said these APIs too internally use “QueryPerformanceCounter()”. Any comments from NVIDIA?
Unless your BIOS is broken, QueryPerformanceCounter should work nicely in multicore environments.
It’s RDTSC that causes problems on multicore by reading the clock register.
If you are going to use thread affinity, you actually may want to look into RDTSC, if you sample often. It is a lot cheaper than QueryPerformanceCounter.