How can I test to see the usefullness of `__syncthreads()`?

user366312 · August 12, 2023, 5:47pm

The following experimental listing is written to test the usefulness of __syncthreads().

I am not finding any difference with or without calling that function.

Why is that? What modifications should I make to observe the difference properly?

#include <stdio.h>
#include <cuda.h>

#define N 10

__global__ void addKernel(int *a, int *b, int *c)
{
  int threadId = threadIdx.x;

  // Load input values
  int va = a[threadId];
  int vb = b[threadId];

  // Add the values
  int sum = va * vb;

  // Synchronize threads before writing output
  __syncthreads();

  // Write output
  c[threadId] = sum;
}


int main()
{
  int ha[N], hb[N], hc[N];

  // Initialize input data on host
  for(int i=0; i<N; i++) {
    ha[i] = i;
    hb[i] = i;
  }

  int *da, *db, *dc;

  // Allocate device memory
  cudaMalloc((void**)&da, N*sizeof(int));
  cudaMalloc((void**)&db, N*sizeof(int));
  cudaMalloc((void**)&dc, N*sizeof(int));

  // Copy inputs to device
  cudaMemcpy(da, ha, N*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(db, hb, N*sizeof(int), cudaMemcpyHostToDevice);

  // Launch kernel
  dim3 block(N);
  addKernel<<<1, block>>>(da, db, dc);

  // Synchronize and copy result to host
  cudaDeviceSynchronize();
  cudaMemcpy(hc, dc, N*sizeof(int), cudaMemcpyDeviceToHost);

  // Print results
  for(int i=0; i<N; i++)
    printf("hc[%d] = %d\n", i, hc[i]);

  // Free device memory
  cudaFree(da);
  cudaFree(db);
  cudaFree(dc);

  return 0;
}

Robert_Crovella · August 12, 2023, 6:09pm

The work that you have here is independent among threads. You will get the right answer (each thread will produce the right answer) independent of the order of thread execution, because no thread’s work depends on the work done by any other thread.

Therefore, __syncthreads(), which is an execution barrier will have no effect on your code. Roughly speaking, syncthreads says “all threads in the block must reach this point, before any thread is allowed to proceed beyond this point”. Having that in your code (or not) won’t affect the output; thread ordering does not affect the output of your code.

A typical place that __syncthreads() is useful is when there is shared memory communication: some threads writing to shared memory, and other threads reading those values. When one thread writes a value to shared memory and another thread reads that value from shared memory, that is a form of inter-thread communication. To make such communication work, its usually necessary to enforce an ordering such that the thread that is doing the writing actually does the writing, before the other thread is allowed to read the value. Hopefully it is obvious why this would be needed for correctness. Since CUDA doesn’t provide any intrinsic thread ordering, you must impose thread ordering to make this kind of communication work.

Here is a trivial example: 256 threads that are reversing a vector. If the input is:

0 1 2 3 4 … 255

then the desired output is:

255 254 253 … 0

Try commenting out the syncthreads statement:

#include <iostream>

const int N=256;  // must be 1024 or less
__global__ void reverse(int *d){

  __shared__ int s[N];
  s[threadIdx.x] = d[threadIdx.x];
  __syncthreads();
  d[threadIdx.x] = s[N-threadIdx.x-1];
}

int main(){

  int *d;
  cudaMallocManaged(&d, N*sizeof(d[0]));
  for (int i = 0; i < N; i++) d[i] = i;
  for (int i = 0; i < N; i++) std::cout << d[i] << " ";
  std::cout << std::endl;
  reverse<<<1,N>>>(d);
  cudaDeviceSynchronize();
  cudaError_t err = cudaGetLastError();
  if (err != cudaSuccess) std::cout << cudaGetErrorString(err);
  else for (int i = 0; i < N; i++) std::cout << d[i] << " ";
  std::cout << std::endl;
}

As an additional experiment, try commenting out the __syncthreads(), but reduce N to 32. See if you can explain the behavior. Note that lockstep execution is a possible behavior for a CUDA GPU, not a guaranteed one. Therefore the code is not technically correct without the __syncthreads() regardless of the value of N. Remember, CUDA doesn’t guarantee an order of thread execution.

system · August 26, 2023, 6:10pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
syncthreads() issue CUDA Programming and Performance	3	1670	March 29, 2009
Is __syncthreads() abnormal in for loop? CUDA Programming and Performance	3	3017	November 14, 2013
Semantics of __syncthreads CUDA Programming and Performance	18	17980	January 2, 2008
__syncthreads question CUDA Programming and Performance	9	2024	September 30, 2009
__syncthreads() + shared memory issue CUDA Programming and Performance	7	5577	August 26, 2008
__syncthreads thread syncronization CUDA Programming and Performance	7	18508	October 27, 2009
IS __syncthread() resetting shared memory values? CUDA Programming and Performance	2	711	August 9, 2018
__syncthreads() problem __syncthreads() results in infinite loop CUDA Programming and Performance	5	2195	August 27, 2008
Strange __syncthreads behavior CUDA Programming and Performance	2	1041	January 21, 2014
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	40	August 21, 2024

How can I test to see the usefullness of `__syncthreads()`?

Related topics