High cuCtxSynchronize overhead

Hi,

I have two machines:

  1. CPU: Intel® Core™ i5 CPU 750 @ 2.67GHz (8K cache)
    GPU: Tesla C2075
  2. CPU: 2 sockets x Intel® Xeon® CPU X5670 @ 2.93GHz (12K cache)
    GPU: 3x Tesla M2070

I wrote the following test in order to check synchronization overhead (cuCtxSynchronize, cuEventSynchronize etc.):

for i =0; i < 10000; i++
cuEventRecord(event[i])
cuCtxSynchronize()

both the events and the context where created to be polling for minimum latency.

The weird thing is that in machine 1 the elapsed time between events is ~7us while in machine 2 it’s ~27us!
The drivers are the same, I also tried to add to the first machine another gpu (thought maybe it has to do with the multiple gpu) but saw same results.

anyone has idea what can cause this major diff?

Thanks.

More information

/usr/bin/time output:
machine1 machine2
User time (seconds) 0.1 0.1
System time (seconds) 0.14 0.55
Percent of CPU this job got 93% 98%
Elapsed (wall clock) time (h 0.26 0.67
Average shared text size (kbytes) 0 0
Average unshared data size (kbytes) 0 0
Average stack size (kbytes) 0 0
Average total size (kbytes) 0 0
Maximum resident set size (kbytes) 288704 305776
Average resident set size (kbytes) 0 0
Major (requiring I/O) page faults 0 0
Minor (reclaiming a frame) page faults 15784 15844
Voluntary context switches 54 909
Involuntary context switches 13 0
Swaps 0 0
File system inputs 0 0
File system outputs 0 0
Socket messages sent 0 0
Socket messages received 0 0
Signals delivered 0 0
Page size (bytes) 4096 4096
Exit status 0 0