cudamemcpy timings vary over iteration

I am trying to measure the latency and bandwidth over CARMA board.

For every run cudamemcpy gives differnt values and varies. I have tried using both

  1. cudamemcpyasync with event
    for(int i=0;i< iterations;i++)
    {

    //start event
    cudaEventRecord(start, 0);

    cudamemcpyAsync();

    //stop event
    cudaEventRecord(end, 0);

    //event sync on stop and get timing
    cudaEventSynchronize(end);
    cudaEventElapsedTime(&elapsedtime, start, end);
    }

  2. cudamemcpy with cpu timings
    for(int i=0;i< iterations;i++)
    {

    //start CPU timer

    cudamemcpy();

    //End CPU timer
    Print the timings
    }

For each iteration I get different varing time in the range of 150 micro seconds to 1000 micro seconds for sending a single element to measure laency.

Why doed this variation happen?

I understand this might not the exact forum to post CARMA queries but since this is a CUDA query would like to know if anybody has faced similar issue.