What the relationship between syncthreads() and threadfence()

LibAndLab · February 21, 2022, 1:42am

hello, i am trying to understand memory barrier, and i found thar
__syncthreads() is compiled as BAR.SYNC;
__threadfence() is compiled as MEMBAR.SC.GPU;
i also use some metrics to do some tests:
my first kernel dim is (1,64):

__global__ void LtsTRequestsApertureDeviceOpMembarKernel0(unsigned int *input, unsigned int *output) {
     input[0] = threadIdx.x;
    __syncthreads();
    output[threadIdx.x] = threadIdx.x + 1;
}

the test results are:

lts__t_requests_aperture_device_op_membar.sum                                  request                              1
    lts__t_requests_aperture_device_op_membar_lookup_hit.sum                       request                              0
    lts__t_requests_aperture_device_op_membar_lookup_miss.sum                      request                              1

then i change my kernel with same dim:

__global__ void LtsTRequestsApertureDeviceOpMembarKernel0(unsigned int *input, unsigned int *output) {
    input[0] = threadIdx.x;
    __threadfence();
    output[threadIdx.x] = threadIdx.x + 1;
}

the results are :

    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_requests_aperture_device_op_membar.sum                                  request                              2
    lts__t_requests_aperture_device_op_membar_lookup_hit.sum                       request                              0
    lts__t_requests_aperture_device_op_membar_lookup_miss.sum                      request                              2
    ---------------------------------------------------------------------- --------------- ------------------------------

my question are as follows :

why my first kernel can still get count of 1 request, lts__t_requests_aperture_device_op_membar should only count instrucyion aboue MEMABR ?
what is the relationship of Memory Barrier and Barrier Synchronization?

Robert_Crovella · February 23, 2022, 8:18pm

they are documented.

LibAndLab · February 24, 2022, 2:06am

thanks for reply， after reading the document, i did some test :

the kernel is:

__global__ void LtsTRequestsApertureDeviceOpMembarKernel(unsigned int *input, unsigned int *output) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    input[index] = index;
    __threadfence_block();
    output[index] = input[index] + 1;
}

i run this kernel with different grid dim: (1,32) (1,256) (1,512) (1,768) (1,1024),
i use ncu to watch LTS request for memory barrier, here is the result:
lts__t_requests_aperture_device_op_membar :1 4 8 12 16
lts__t_requests_aperture_device_op_membar_lookup_hit : 0 0 0 0 0
lts__t_requests_aperture_device_op_membar_lookup_miss:1 4 8 12 16

i found that a block with 1024 threads get 16 request for memory barrier just as the document says: each CTA has sixteen barrier.
however i am wondering why they are all miss,
how should i chang my kernel to get a non zero hit of request for memory barrier

LibAndLab · February 24, 2022, 3:00am

i found your answer in stacko verflow，you say that：

__syncthreads() is a (device-wide) memory fence， It forces any thread that has written the value, to make that value visible. This effectively means, since this is a device-wide memory fence, that the value written at least has populated the L2 cache

Note that there is a subtle distinction here. Other threads, in other blocks, at other points in the code, may or may not see the value written by a different block.

so can i say:
__threadfence_block is a block-wide memory fence, this effectively means, since this is a block-wide memory fence, that the value written at least has populated the L1 cache. In other blocks, at other points in the code, may or may not see the value written by a different block.

__threadfence() is device-wide memory fence, this effectively means, since this is a device-wide memory fence, that the value written at least has populated the L2 cache; In other blocks, at other points in the code, may or may not see the value written by a different block.

__threadfence_system() is system-wide memory fence, this effectively means, since this is a sysmem-wide memory fence, that the value written at least has populated the device memory.

i also found that there is some different between your answer and doc：

__threadfence() guarantees that the writing thread will eventually push its data to the L2 (i.e. make it visible).

// Thread 0 makes sure that the incrementation
// of the “count” variable is only performed after
// the partial sum has been written to global memory.
__threadfence();

your answer say push data to L2 but doc say write to global memory.

Robert_Crovella · February 28, 2022, 2:55pm

global memory is a logical space.

L2 is a physical resource

It is two different views of the same activity.

Topic		Replies	Views
Question related __threadfence CUDA Programming and Performance	13	5041	January 12, 2016
Difference between __syncthreads() and __threadfence() Nsight Compute	3	1582	March 17, 2022
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29136	April 22, 2015
__threadfence_block() vs __threadfence() ? CUDA Programming and Performance	6	6559	July 13, 2022
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	2918	January 25, 2010
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2619	April 9, 2012
Trying to understand memory fence function example CUDA Programming and Performance	3	14554	March 24, 2018
__syncthreads question CUDA Programming and Performance	9	2023	September 30, 2009
Global thread barrier CUDA Programming and Performance	78	85556	December 23, 2011
using PTX barrier.sync CUDA Programming and Performance	12	3793	March 27, 2019

What the relationship between __syncthreads() and __threadfence()

Related topics

What the relationship between syncthreads() and threadfence()