First of all I am using 8600 GT, and …
I am absolutely confused while I am using cutCreateTimer.
I wrote a program that makes a “dot multiplication” - each thread makes only one multiplication of 2 elements from two different matrixes. When I multiply f.i.
 matrix A [16 x 16 ] by B [16 x 1]
there is hardly no difference betwen multiplying
 matrix A [16384 x 16 ] by B [16384 x 1] ??
In the first case timer shows 0,293241 ms in the seconds 0,300312 ??
When I checked the overhead of cutCreateTimer function it was about 0,12 ms, and previously I thought that having much less than 768 threads in  the time of performance on a card is 0,293241 - 0,12 [ms], and I thought that if I use more than 768 threads in  ( as we know only 768 thread can run on the same time ), the performance will last much longer and I was surprised that there was hardly no difference whatsoever.
All in all I ask you for telling me :
what kind of timer to count a time of performance on device do you use in your programs ?
what does cudaThreadSynchronize() do ?
does anyone have an idea why the fact of having more than 768 threads did not exert any infuence in time of performance in my  case ?
Thank you for your help indeed,