hello, i am trying to understand memory barrier, and i found thar
__syncthreads() is compiled as BAR.SYNC;
__threadfence() is compiled as MEMBAR.SC.GPU;
i also use some metrics to do some tests:
my first kernel dim is (1,64):
__global__ void LtsTRequestsApertureDeviceOpMembarKernel0(unsigned int *input, unsigned int *output) {
input[0] = threadIdx.x;
__syncthreads();
output[threadIdx.x] = threadIdx.x + 1;
}
the test results are:
lts__t_requests_aperture_device_op_membar.sum request 1
lts__t_requests_aperture_device_op_membar_lookup_hit.sum request 0
lts__t_requests_aperture_device_op_membar_lookup_miss.sum request 1
then i change my kernel with same dim:
__global__ void LtsTRequestsApertureDeviceOpMembarKernel0(unsigned int *input, unsigned int *output) {
input[0] = threadIdx.x;
__threadfence();
output[threadIdx.x] = threadIdx.x + 1;
}
the results are :
---------------------------------------------------------------------- --------------- ------------------------------
lts__t_requests_aperture_device_op_membar.sum request 2
lts__t_requests_aperture_device_op_membar_lookup_hit.sum request 0
lts__t_requests_aperture_device_op_membar_lookup_miss.sum request 2
---------------------------------------------------------------------- --------------- ------------------------------
my question are as follows :
- why my first kernel can still get count of 1 request, lts__t_requests_aperture_device_op_membar should only count instrucyion aboue MEMABR ?
- what is the relationship of Memory Barrier and Barrier Synchronization?