Why is this CUDA program ignoring if-else guard?

The following experimental program is supposed to print block and grid dimensions only once.

Once the variables isBlockDimPrinted and isGridDimPrinted set to true by one thread, the other threads should not have found them set to false.

If each thread doesn’t execute its own copy of the kernel code, then why are threads not following the if-else condition?

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>


__device__ bool isBlockDimPrinted = false;
__device__ bool isGridDimPrinted = false;


__global__ void MatMulKernel()
{
  if(!isBlockDimPrinted){
     printf("Block Dim (%d, %d, %d), ", blockDim.x, blockDim.y, blockDim.z);
     //printf("\n");
     isBlockDimPrinted = true;
  }

  if(!isGridDimPrinted){
     printf("Grid Dim (%d, %d, %d), ", gridDim.x, gridDim.y, gridDim.z);
     //printf("\n");
     isGridDimPrinted = true;
  }

  printf("BlockIdx (%d, %d, %d)\t", blockIdx.x, blockIdx.y, blockIdx.z);
  //printf("\n");
  printf("threadIdx (%d, %d, %d)\t", threadIdx.x, threadIdx.y, threadIdx.z);
  //printf("\n");
}


int main()
{
    dim3 dimBlock(1,2,3);
    dim3 dimGrid(3, 2, 1);

    MatMulKernel<<<dimGrid, dimBlock>>>();

    cudaDeviceSynchronize();

    return 0;
}

However, the output is unexpected as the blockDim and gridDim are being printed multiple times:

user_name@192:~/CUDA_lab$ nano thread_id_test_matrix.cu
user_name@192:~/CUDA_lab$ nvcc thread_id_test_matrix.cu -o exe
user_name@192:~/CUDA_lab$ ./exe
Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Block Dim (1, 2, 3), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), Grid Dim (3, 2, 1), BlockIdx (1, 1, 0)  BlockIdx (1, 1, 0)      BlockIdx (1, 1, 0)      BlockIdx (1, 1, 0)      BlockIdx (1, 1, 0)      BlockIdx (1, 1, 0)      BlockIdx (2, 0, 0)      BlockIdx (2, 0, 0)      BlockIdx (2, 0, 0)      BlockIdx (2, 0, 0)      BlockIdx (2, 0, 0)      BlockIdx (2, 0, 0)      BlockIdx (0, 0, 0)      BlockIdx (0, 0, 0)      BlockIdx (0, 0, 0)      BlockIdx (0, 0, 0)      BlockIdx (0, 0, 0)      BlockIdx (0, 0, 0)      BlockIdx (1, 0, 0)      BlockIdx (1, 0, 0)      BlockIdx (1, 0, 0)      BlockIdx (1, 0, 0)      BlockIdx (1, 0, 0)      BlockIdx (1, 0, 0)      BlockIdx (0, 1, 0)      BlockIdx (0, 1, 0)      BlockIdx (0, 1, 0)      BlockIdx (0, 1, 0)      BlockIdx (0, 1, 0)      BlockIdx (0, 1, 0)      BlockIdx (2, 1, 0)      BlockIdx (2, 1, 0)      BlockIdx (2, 1, 0)      BlockIdx (2, 1, 0)      BlockIdx (2, 1, 0)      BlockIdx (2, 1, 0)      threadIdx (0, 0, 0)     threadIdx (0, 1, 0)     threadIdx (0, 0, 1)     threadIdx (0, 1, 1)     threadIdx (0, 0, 2)     threadIdx (0, 1, 2)     threadIdx (0, 0, 0)     threadIdx (0, 1, 0)     threadIdx (0, 0, 1)     threadIdx (0, 1, 1)     threadIdx (0, 0, 2)     threadIdx (0, 1, 2)     threadIdx (0, 0, 0)     threadIdx (0, 1, 0)     threadIdx (0, 0, 1)     threadIdx (0, 1, 1)     threadIdx (0, 0, 2)     threadIdx (0, 1, 2)     threadIdx (0, 0, 0)     threadIdx (0, 1, 0)     threadIdx (0, 0, 1)     threadIdx (0, 1, 1)     threadIdx (0, 0, 2)     threadIdx (0, 1, 2)     threadIdx (0, 0, 0)     threadIdx (0, 1, 0)     threadIdx (0, 0, 1)     threadIdx (0, 1, 1)     threadIdx (0, 0, 2)     threadIdx (0, 1, 2)     threadIdx (0, 0, 0)     threadIdx (0, 1, 0)     threadIdx (0, 0, 1)     threadIdx (0, 1, 1)     threadIdx (0, 0, 2)     threadIdx (0, 1, 2)     

user_name@192:~/CUDA_lab$

Why is that happening?

How can I fix this?

What happens when thread M prints the dimensions, then gets switched out before the being able to set the flag, allowing thread N to run which has just started executing?

One of the fundamental architectural features of GPUs is zero-overhead switching between threads. As soon as a thread stalls for any reason (e.g. a memory access), execution switches to an un-blocked thread.

Is it possible to do it without using atomic operations?

A possible execution mode for CUDA threads is lockstep.

That means you don’t just have 1 thread executing this line of code:

You have 32 threads doing it at the same time. In lockstep. cycle-by-cycle. If we decompose that if-statement at the machine level (which is where lockstep execution is actually defined: at the SASS level), the first thing that will happen is that all threads in the warp will read the global location indicated by isBlockDimPrinted. Then all threads in the warp will test the value against zero or 1 or however the machine decides the boolean state, and will set another variable with that boolean value. Then those threads will decide to run or not run the body of the if-statement based on that boolean outcome. All of this is in lockstep for 32 threads in a warp.

So, if one thread in the warp reads that boolean value as false, they all will. Because they all read that variable at precisely the same clock cycle. In that clock cycle it was either true or false, and that state was observed by all 32 threads (in that warp).

This is the nature of lockstep execution. It is not formally the most evident part of the CUDA programming model, but it certainly exists.

When you use atomics, you break this pattern. An atomic forces a read-modify-write operation to serialize all the way down to the thread level. It disrupts typical warp execution in this way.

Are there ways to do it without atomics? At the warp level, yes. You could have the threads in a warp read the variable, and then have a scheme to allow one thread to “win” in the event of the same value being read by all threads. There are various warp vote, warp shuffle, and cooperative groups primitives that might allow you to do something at the warp level.

It requires communication of some sort to do this (break the pattern without atomics) so at a granularity higher than the warp level, you would need either something like shared memory to communicate amongst threads in a threadblock to enable the communication up to the block level, or again, use some cooperative groups primitives, to enable this to happen perhaps all the way up to the device wide level.

I think if you did any of these, you would discover that they are “expensive” mechanisms, and the expense would increase as you increase the granularity of communication/cooperation. But atomics could be expensive as well, at some level of granularity.

You might wish to get an orderly introduction to CUDA, this is one series I usually recommend. The programming guide also covers some of these execution concepts in some detail. In fact, the first 5-7 sections of the programming guide constitute an orderly introduction to basic CUDA concepts, presented in a fairly readable way.

Thanks for the answer.

I accepted it.

However, this answer is too technical for me to understand.

In that case, you might wish to go back to njuffa’s answer. It is sufficient, and because it is short, is probably more elegant. At a detailed level there are multiple reasons why two threads could read the same value from the same location given your code. njuffa’s answer covers cases that mine does not. And furthermore if we don’t demand a detailed explanation, it is sufficient; it provides a complete explanation.

CUDA specifies no particular order of thread execution, and this is a good thing for all CUDA programmers to internalize. That means that two threads, M and N, can execute in such a way that M executes fully before N, fully after N, or with any sort of overlap that you can imagine (including the lockstep case I suggested). If you apply that mental model, you should be able to reason why your code could produce the output you have described. There is no guarantee in CUDA that an if-statement (including its body) will be fully executed by a thread, before any other thread begins to execute that same if-statement.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.