Calculation of Synchronization Cost

Hello!

I am trying to calculate the minimum synchronization cost for 256 threads per block, by varying the number of blocks. I am doing a very simple global write and storing the cycle count when memory operation starts, when synchronization starts and when synchronization ends. Max sync cost = (max cycle count from the sync counts obtained when sync ends) - (min cycle count from the sync counts obtained when sync starts).
When I do this, the difference I obtain is very large(8525002 cycles for 10 blocks, 256 threads per block, Fermi architecture). I am running this on a Fermi architecture card(2.0). I am not sure where I am going wrong.

Here is my kernel code:
global void global_mem_write_kernel(float* data_dev, unsigned int * start_dev, unsigned int * mid_dev, unsigned int * end_dev) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int start_reg, mid_reg, end_reg;
//Start time for mem-op
asm volatile(“mov.u32 %0, %%clock;” : “=r”(start_reg));

//memory write
data_dev[index] = index;

//Synchronization start time and Mem-op end time
asm volatile(“mov.u32 %0, %%clock;” : “=r”(mid_reg));

__syncthreads();

//Synchronization end time
asm volatile(“mov.u32 %0, %%clock;” : “=r”(end_reg));

start_dev[index] = start_reg;
mid_dev[index] = mid_reg;
end_dev[index] = end_reg;
}

in CPU, I am calculating the max cost as:
max_sync cost = end_max - mid_min

Thanks in advance!