Difference between __syncthreads() and __threadfence()

hello, i am trying to understand NCU Metrics : lts__t_requests_aperture_device_op_membar, lts__t_requests_aperture_device_op_membar_lookup_hit, lts__t_requests_aperture_device_op_membar_lookup_miss.
my first kernel dim is (1,64):

__global__ void LtsTRequestsApertureDeviceOpMembarKernel0(unsigned int *input, unsigned int *output) {
     input[0] = threadIdx.x;
    __syncthreads();
    output[threadIdx.x] = threadIdx.x + 1;
}

the results are:

    lts__t_requests_aperture_device_op_membar.sum                                  request                              1
    lts__t_requests_aperture_device_op_membar_lookup_hit.sum                       request                              0
    lts__t_requests_aperture_device_op_membar_lookup_miss.sum                      request                              1
    ---------------------------------------------------------------------- --------------- ------------------------------

when i use cuobjdump -sass to get the sass , the sass instruction is BAR.SYNC ;

then i change my kernel with same dim:

__global__ void LtsTRequestsApertureDeviceOpMembarKernel0(unsigned int *input, unsigned int *output) {
    input[0] = threadIdx.x;
    __threadfence();
    output[threadIdx.x] = threadIdx.x + 1;
}

the results are :

    ---------------------------------------------------------------------- --------------- ------------------------------
    lts__t_requests_aperture_device_op_membar.sum                                  request                              2
    lts__t_requests_aperture_device_op_membar_lookup_hit.sum                       request                              0
    lts__t_requests_aperture_device_op_membar_lookup_miss.sum                      request                              2
    ---------------------------------------------------------------------- --------------- ------------------------------

the sass instruction of __threadfence is MEMBAR.SC.GPU ;

the Nvidia document says:
MEMBAR : Memory Barrier
BAR: Barrier Synchronization.

my question are as follows:

  1. why my first kernel can still get count of 1 request?
  2. what is the difference of Memory Barrier and Barrier Synchronization?

Synchronization barriers such as __syncthreads or __syncwarp synchronize the execution of all participating threads, meaning that threads will wait until all participants have reached the synchronization point. They also guarantee that memory accesses made by these threads prior to invoking the barrier are visible to all participants.

Memory barriers do not imply thread execution synchronization, but are solely concerned with ensuring that writes are visible in the scope implied in the instruction (e.g. “GPU” or “SYS”).

Regarding why the metric reports one membar even for the first sample, I would expect this to be the final membar that is usually generated for the end of the kernel, to ensure that everything is written to memory. You can confirm this by checking the entire kernel SASS in the Nsight Compute UI’s Source page.

thanks for reply! your answer is quite helpful

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.