What the relationship between __syncthreads() and __threadfence()

hello, i am trying to understand memory barrier, and i found thar
__syncthreads() is compiled as BAR.SYNC;
__threadfence() is compiled as MEMBAR.SC.GPU;
i also use some metrics to do some tests:
my first kernel dim is (1,64):

__global__ void LtsTRequestsApertureDeviceOpMembarKernel0(unsigned int *input, unsigned int *output) {
     input[0] = threadIdx.x;
    __syncthreads();
    output[threadIdx.x] = threadIdx.x + 1;
}

the test results are:

lts__t_requests_aperture_device_op_membar.sum                                  request                              1
    lts__t_requests_aperture_device_op_membar_lookup_hit.sum                       request                              0
    lts__t_requests_aperture_device_op_membar_lookup_miss.sum                      request                              1

then i change my kernel with same dim:

__global__ void LtsTRequestsApertureDeviceOpMembarKernel0(unsigned int *input, unsigned int *output) {
    input[0] = threadIdx.x;
    __threadfence();
    output[threadIdx.x] = threadIdx.x + 1;
}

the results are :

    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_requests_aperture_device_op_membar.sum                                  request                              2
    lts__t_requests_aperture_device_op_membar_lookup_hit.sum                       request                              0
    lts__t_requests_aperture_device_op_membar_lookup_miss.sum                      request                              2
    ---------------------------------------------------------------------- --------------- ------------------------------

my question are as follows :

  1. why my first kernel can still get count of 1 request, lts__t_requests_aperture_device_op_membar should only count instrucyion aboue MEMABR ?
  2. what is the relationship of Memory Barrier and Barrier Synchronization?

they are documented.

thanks for reply, after reading the document, i did some test :

the kernel is:

__global__ void LtsTRequestsApertureDeviceOpMembarKernel(unsigned int *input, unsigned int *output) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    input[index] = index;
    __threadfence_block();
    output[index] = input[index] + 1;
}

i run this kernel with different grid dim: (1,32) (1,256) (1,512) (1,768) (1,1024),
i use ncu to watch LTS request for memory barrier, here is the result:
lts__t_requests_aperture_device_op_membar :1 4 8 12 16
lts__t_requests_aperture_device_op_membar_lookup_hit : 0 0 0 0 0
lts__t_requests_aperture_device_op_membar_lookup_miss:1 4 8 12 16

i found that a block with 1024 threads get 16 request for memory barrier just as the document says: each CTA has sixteen barrier.
however i am wondering why they are all miss,
how should i chang my kernel to get a non zero hit of request for memory barrier

i found your answer in stacko verflow,you say that:

__syncthreads() is a (device-wide) memory fence, It forces any thread that has written the value, to make that value visible. This effectively means, since this is a device-wide memory fence, that the value written at least has populated the L2 cache

Note that there is a subtle distinction here. Other threads, in other blocks, at other points in the code, may or may not see the value written by a different block.

so can i say:
__threadfence_block is a block-wide memory fence, this effectively means, since this is a block-wide memory fence, that the value written at least has populated the L1 cache. In other blocks, at other points in the code, may or may not see the value written by a different block.

__threadfence() is device-wide memory fence, this effectively means, since this is a device-wide memory fence, that the value written at least has populated the L2 cache; In other blocks, at other points in the code, may or may not see the value written by a different block.

__threadfence_system() is system-wide memory fence, this effectively means, since this is a sysmem-wide memory fence, that the value written at least has populated the device memory.

i also found that there is some different between your answer and doc

__threadfence() guarantees that the writing thread will eventually push its data to the L2 (i.e. make it visible).

// Thread 0 makes sure that the incrementation
// of the “count” variable is only performed after
// the partial sum has been written to global memory.
__threadfence();

your answer say push data to L2 but doc say write to global memory.

3 Likes

global memory is a logical space.

L2 is a physical resource

It is two different views of the same activity.

1 Like