Execution time variation of cudaMemcpy(,,DeviceToHost) with small data size(256 Byte)

Hello,

I want to know execution time of cudaMemcdpy(dst,src,256, DeviceToHost),
in particular dependency to the interval time between each cudaMemcpy()s.
so I wrote the short code below and ran.

Then I got the 2 strange result like this.
(1) the execution time widely varies from 14us to 414us.
(2) the execution time depends on interval time between each cudaMemcpy.
In case interval is smaller then 10ms, average execution time is around 30us,
and in case over 10ms that increase around 230us.

— RESULT ------------------------------------------------------------------------
“interval”: interval time between cudamemcpy().
“t_avr”: average execution time of cudamemcpy().
“t_min”: minimum execution time of cudamemcpy().
“t_max”: maximum execution time of cudamemcpy().

elapsed_time interval= 0 [us]: t_avr= 0.000022 , t_min= 0.000014 , t_max= 0.000414 [s]
elapsed_time interval= 1000 [us]: t_avr= 0.000028 , t_min= 0.000015 , t_max= 0.000364 [s]
elapsed_time interval= 2000 [us]: t_avr= 0.000021 , t_min= 0.000015 , t_max= 0.000133 [s]
elapsed_time interval= 3000 [us]: t_avr= 0.000025 , t_min= 0.000021 , t_max= 0.000127 [s]
elapsed_time interval= 4000 [us]: t_avr= 0.000021 , t_min= 0.000016 , t_max= 0.000084 [s]
elapsed_time interval= 5000 [us]: t_avr= 0.000018 , t_min= 0.000016 , t_max= 0.000049 [s]
elapsed_time interval= 6000 [us]: t_avr= 0.000022 , t_min= 0.000016 , t_max= 0.000134 [s]
elapsed_time interval= 7000 [us]: t_avr= 0.000026 , t_min= 0.000016 , t_max= 0.000145 [s]
elapsed_time interval= 8000 [us]: t_avr= 0.000026 , t_min= 0.000016 , t_max= 0.000157 [s]
elapsed_time interval= 9000 [us]: t_avr= 0.000028 , t_min= 0.000016 , t_max= 0.000137 [s]
elapsed_time interval= 10000 [us]: t_avr= 0.000229 , t_min= 0.000023 , t_max= 0.000368 [s]
elapsed_time interval= 11000 [us]: t_avr= 0.000236 , t_min= 0.000044 , t_max= 0.000360 [s]
elapsed_time interval= 12000 [us]: t_avr= 0.000241 , t_min= 0.000051 , t_max= 0.000367 [s]
elapsed_time interval= 13000 [us]: t_avr= 0.000241 , t_min= 0.000051 , t_max= 0.000381 [s]
elapsed_time interval= 14000 [us]: t_avr= 0.000240 , t_min= 0.000044 , t_max= 0.000614 [s]

I have no trouble about longer execution time 300us, that’s acceptable.
But I want to know why execution time is so widely varied.
(or I have wrong source code?)

Would you please teach me the reason of wide variation of execution time?

— my GPU environment ----
CUDA version: 5.0
OS: Linux 2.6.32 (64bit)
GPU: Tesla K20X

#define NUM_SAMPLE (10000)
const unsigned int minsize = 256;
const unsigned int maxsize = 1024 * 1024 * 8;

...
void getTime(double &laptime, double &splittime)
{
  struct timespec tp;
  double sec,nsec;
  clock_gettime(CLOCK_MONOTONIC,&tp);
  sec=tp.tv_sec;
  nsec=tp.tv_nsec;
  splittime = sec + nsec*1e-9 - laptime;
  laptime = sec + nsec*1e-9;
}
...
inline void
recvPerf(char *dst, char *src, unsigned int sleep_us, unsigned int size_u)
{
   size_t size = (size_t)size_u;
   double t_lap, t_split[NUM_SAMPLE];
   double t_avr=0.0, t_max=0.0, t_min=1e3;

   cudaMemcpy(dst, src, size, cudaMemcpyDeviceToHost); // dummy                                
   for (int i = 0; i < NUM_SAMPLE; i++) {
      cudaDeviceSynchronize();
      usleep( sleep_us ); // interval time
      cudaMemcpy(D2H)                                                  
      getTime(t_lap, t_split[i]);
      /******************/
      /* Measure Target */
      /******************/
      cudaMemcpy(dst, src, size, cudaMemcpyDeviceToHost);
      cudaDeviceSynchronize();
      getTime(t_lap, t_split[i]);
   }

   /* check avr/min/max */
   for (int i=0; i<NUM_SAMPLE; i++) {
      t_avr += t_split[i] / (double)NUM_SAMPLE;
      if (t_split[i] > t_max) t_max = t_split[i];
      if (t_split[i] < t_min) t_min = t_split[i];
   }
   for (int i=0; i<NUM_SAMPLE; i++) {
      if (i < 10 || i > NUM_SAMPLE-10) {
         printf("# t_split[%4d]= %f\n", i, t_split[i]);
      } else if (t_split[i] >= t_avr + 0.000150) {
         printf("# t_split[%4d]= %f *\n", i, t_split[i]);
      }
   }
   printf("elapsed_time interval= %u [us]: t_avr= %f , t_min= %f , t_max= %f [s]\n",
          sleep_us, t_avr, t_min, t_max);
   fflush(stdout);
}
...

Thank a lot.
Otapyonn

The launching time is about same order.