How can I test to see the usefullness of `__syncthreads()`?

The following experimental listing is written to test the usefulness of __syncthreads().

I am not finding any difference with or without calling that function.

Why is that? What modifications should I make to observe the difference properly?

#include <stdio.h>
#include <cuda.h>

#define N 10

__global__ void addKernel(int *a, int *b, int *c)
  int threadId = threadIdx.x;

  // Load input values
  int va = a[threadId];
  int vb = b[threadId];

  // Add the values
  int sum = va * vb;

  // Synchronize threads before writing output

  // Write output
  c[threadId] = sum;

int main()
  int ha[N], hb[N], hc[N];

  // Initialize input data on host
  for(int i=0; i<N; i++) {
    ha[i] = i;
    hb[i] = i;

  int *da, *db, *dc;

  // Allocate device memory
  cudaMalloc((void**)&da, N*sizeof(int));
  cudaMalloc((void**)&db, N*sizeof(int));
  cudaMalloc((void**)&dc, N*sizeof(int));

  // Copy inputs to device
  cudaMemcpy(da, ha, N*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(db, hb, N*sizeof(int), cudaMemcpyHostToDevice);

  // Launch kernel
  dim3 block(N);
  addKernel<<<1, block>>>(da, db, dc);

  // Synchronize and copy result to host
  cudaMemcpy(hc, dc, N*sizeof(int), cudaMemcpyDeviceToHost);

  // Print results
  for(int i=0; i<N; i++)
    printf("hc[%d] = %d\n", i, hc[i]);

  // Free device memory

  return 0;

The work that you have here is independent among threads. You will get the right answer (each thread will produce the right answer) independent of the order of thread execution, because no thread’s work depends on the work done by any other thread.

Therefore, __syncthreads(), which is an execution barrier will have no effect on your code. Roughly speaking, syncthreads says “all threads in the block must reach this point, before any thread is allowed to proceed beyond this point”. Having that in your code (or not) won’t affect the output; thread ordering does not affect the output of your code.

A typical place that __syncthreads() is useful is when there is shared memory communication: some threads writing to shared memory, and other threads reading those values. When one thread writes a value to shared memory and another thread reads that value from shared memory, that is a form of inter-thread communication. To make such communication work, its usually necessary to enforce an ordering such that the thread that is doing the writing actually does the writing, before the other thread is allowed to read the value. Hopefully it is obvious why this would be needed for correctness. Since CUDA doesn’t provide any intrinsic thread ordering, you must impose thread ordering to make this kind of communication work.

Here is a trivial example: 256 threads that are reversing a vector. If the input is:

0 1 2 3 4 … 255

then the desired output is:

255 254 253 … 0

Try commenting out the syncthreads statement:

#include <iostream>

const int N=256;  // must be 1024 or less
__global__ void reverse(int *d){

  __shared__ int s[N];
  s[threadIdx.x] = d[threadIdx.x];
  d[threadIdx.x] = s[N-threadIdx.x-1];

int main(){

  int *d;
  cudaMallocManaged(&d, N*sizeof(d[0]));
  for (int i = 0; i < N; i++) d[i] = i;
  for (int i = 0; i < N; i++) std::cout << d[i] << " ";
  std::cout << std::endl;
  cudaError_t err = cudaGetLastError();
  if (err != cudaSuccess) std::cout << cudaGetErrorString(err);
  else for (int i = 0; i < N; i++) std::cout << d[i] << " ";
  std::cout << std::endl;

As an additional experiment, try commenting out the __syncthreads(), but reduce N to 32. See if you can explain the behavior. Note that lockstep execution is a possible behavior for a CUDA GPU, not a guaranteed one. Therefore the code is not technically correct without the __syncthreads() regardless of the value of N. Remember, CUDA doesn’t guarantee an order of thread execution.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.