Hello!
I am trying to calculate the minimum synchronization cost for 256 threads per block, by varying the number of blocks. I am doing a very simple global write and storing the cycle count when memory operation starts, when synchronization starts and when synchronization ends. Max sync cost = (max cycle count from the sync counts obtained when sync ends) - (min cycle count from the sync counts obtained when sync starts).
When I do this, the difference I obtain is very large(8525002 cycles for 10 blocks, 256 threads per block, Fermi architecture). I am running this on a Fermi architecture card(2.0). I am not sure where I am going wrong.
Here is my kernel code:
global void global_mem_write_kernel(float* data_dev, unsigned int * start_dev, unsigned int * mid_dev, unsigned int * end_dev) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int start_reg, mid_reg, end_reg;
//Start time for mem-op
asm volatile(“mov.u32 %0, %%clock;” : “=r”(start_reg));
//memory write
data_dev[index] = index;
//Synchronization start time and Mem-op end time
asm volatile(“mov.u32 %0, %%clock;” : “=r”(mid_reg));
__syncthreads();
//Synchronization end time
asm volatile(“mov.u32 %0, %%clock;” : “=r”(end_reg));
start_dev[index] = start_reg;
mid_dev[index] = mid_reg;
end_dev[index] = end_reg;
}
in CPU, I am calculating the max cost as:
max_sync cost = end_max - mid_min
Thanks in advance!