A question about calculatePartialSum sample code in CUDA_C programming guide

There is a smple code to show how to use thread fence to calculate the sum of an array. I have a question about the following code, because no code is provided for this function, any further explainment is appreciated.

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
volatile float* result)
{
   ∕∕ Each block sums a subset of the input array.
   float partialSum = calculatePartialSum(array, N);
   ....

Here, I think partialSum is a variable which belongs to each thread which reside in register, then how threads in a block calculate array’s sum and write it to a register, isn’t it a variable defined in shared memory?
I can’t find sample code of this function, does anybody provide its implementation?

This type of technique is sometimes referred to as a threadFence reduction or a block-draining reduction. A corresponding (complete) sample code is here.

The general intra-block reduction strategy (what calculatePartialSum() does) need not be directly connected to the threadFence method, which is why a full fleshed example in the programming guide is not provided. You can quickly learn how to write the code for it yourself using canonical material such as here.

However if you prefer, we could connect the two examples by saying that calculatePartialSum() would be approximately equivalent to the reduceBlock() code here, and to increase similarity, we could also posit that we would modify the reduceBlock() prototype as follows:

__device__ float reduceBlock(volatile float *sdata, float mySum,
                        const unsigned int tid, cg::thread_block cta) {

and at the end of that function we would include the following statement:

return sdata[0];

Thank you for providing this helpful material. I have a question about the unroll optimization in this material, as follows:

__device__ void warpReduce(volatile int* sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}

// later…
for (unsigned int s=blockDim.x/2; s>32; s>>=1) {
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid < 32) warpReduce(sdata, tid);

In the step of unroling the last 6 iterations(stride <= 32), the material said because it is excuted by only one warp, we don’t need “if (t<s)”.
My question is although all threads in one warp will execute simultaneously(the share common PC), consider the following code:

sdata[tid] += sdata[tid + 1];

When tid#0 and tid#1 excute this code, I think “sdata[0] += sdata[1]” and "sdata[1] += sdata[2] " will happen simultaneously, then is it possible we get a wrong value for sdata[0] because sdata[1] is being changed.

The material I linked is pretty old. Current recommendations are to not use warp-synchronous programming methods. If you look at the current cuda samples you will find the equivalent reduction material that does not employ warp-synchronous methods.

So why was it suggested in the past? How would it have worked?

To unpack your question, we need to start by understanding that C++ is generally a compiled language. The machine does not execute C++ source code directly. To show what the machine would be executing for a single step in the “unrolled warp-synchronous area”:

sdata[tid] += sdata[tid + 1];

It would look something like this:

LD R0, tid  // it would actually be from another register or sequence of special registers
LDS  R1, [R0]   
LD R2, tid+1 // again this is shorthand for what actually happens
LDS R2, [R2]
FADD R3, R1, R2
STS  [R0], R3

Typically, a CUDA warp is executing in lockstep. That means when one thread in the warp is executing LDS R1, [R0] they all are. And likewise for each subsequent instruction. Therefore, to address your question, there is no ambiguity. First, every thread in the warp reads the value from sdata[tid] into a register. Then every thread in the warp reads the value from sdata[tid+1] into a register. Then the two values are added. The register that contains the add result is then stored back to the sdata[tid] location. So the lockstep execution behavior would resolve any ambiguity, when expressed instruction-by-instruction at the machine code level.

We generally don’t teach this type of thinking anymore, especially with the Volta execution model in view. So the above should not be considered proper programming style, any longer.

The CUDA compiler team was always consistent in recommending not to use implicitly warp-synchronous programming methods, as these were not officially santioned by the CUDA programming model as specified in CUDA documentation, instead simply exploiting an implementation artifact.

FWIW, I was always in full agreement with the compiler team’s recommendation as generally relying on implementation artifacts is bound to break. It is just a question of time when that happens. But since the warp-synchronous approach resulted in a performance advantage, its use became wide spread in the early days of CUDA, including some materials produced by NVIDIA themselves.

In this case, the changes NVIDIA made to the GPU hardware execution model with the Volta architecture invalidated assumptions made by implicitly warp-synchronous programming methods.

I read this blog, now I am confusing about the warp-synchronous programming.

Q1: In this blog, author says “Such implicit warp-synchronous programming is unsafe and may not work correctly” and give an example code as follows:

  1 // Assuming all 32 threads in a warp execute line 1 together.
  2 assert(__ballot(1) == FULL_MASK);
  3 int result;
  4 if (thread_id % 2) {
  5     result = foo();
  6 }
  7 else {
  8     result = bar();
  9 }
 10 unsigned ballot_result = __ballot(result)

It said calling the new __syncwarp() primitive at line 10 before __ballot() , does not fix the problem either. The reason it gave is “It assumes that threads in the same warp that are once synchronized will stay synchronized until the next thread-divergent branch. Although it is often true, it is not guaranteed in the CUDA programming model”.
Then how can I understand the saying: CUDA programming model does not guarentee synchronized state will be stay once it is synchronized until thread-divergent? For this case, if I insert a __syncwarp, I think all threads in a warp will be convergent, is it possible to be divergent in __ballot function?

Q2: There is another sample code as follows:

unsigned tid = threadIdx.x;

// Incorrect use of __syncwarp()
shmem[tid] += shmem[tid+16]; __syncwarp();
shmem[tid] += shmem[tid+8];  __syncwarp();
shmem[tid] += shmem[tid+4];  __syncwarp();
shmem[tid] += shmem[tid+2];  __syncwarp();
shmem[tid] += shmem[tid+1];  __syncwarp();

It said " There is a shared memory read followed by a shared memory write between every two __syncwarp() calls. The CUDA programming model does not guarantee that all the reads will be performed before all the writes, so there is a race condition."
How can a write happen before read? I am confusing now, any further explainment is very appreciated!

It’s admittedly difficult to understand in practice. In my view, the way to think about it is not as a practical experiment, trying to understand how a warp may diverge when there is no reason to do so. Instead, it’s best to accept it as a principle, because it promotes the right way to think about how to program in CUDA. The CUDA programming model does not guarantee it, and CUDA can be made to work in an acceptable fashion without that expectation. As has already been indicated, we are dealing with a difference between the programming model and the implementation. In the implementations we have today, its not possible (for me) to propose a situation when a warp diverges for no reason.

Yes, it’s possible to be divergent in a ballot function, and the __ballot_sync() primitive was introduced to remove the ambiguity or uncertainty that a proper treatment of the programming model would create with the “old” __ballot() intrinsic.

To interpret this, we would go back to the example I already gave, and posit the opposite, that the warp is not in lockstep. It is exactly the concern you asked about here:

Let’s go back to my example:

To show what the machine would be executing for a single step in the “unrolled warp-synchronous area”:

sdata[tid] += sdata[tid + 1];

It would look something like this:

LD R0, tid  // it would actually be from another register or sequence of special registers
LDS  R1, [R0]   
LD R2, tid+1 // again this is shorthand for what actually happens
LDS R2, [R2]
FADD R3, R1, R2
STS  [R0], R3

Now, we will suppose that the warp is not perfectly in lockstep. Execution order can be anything we want it to be, considering any two threads in the warp. Let’s take thread 0, which is reading the shared locations 0 and 1, and adding them together, and then storing them at location 0, and thread 1, which is reading locations 1 and 2, and adding them together, and storing them at location 1. So thread 0 is reading location 1 and thread 1 is both reading and writing location 1. If the order of execution between thread 0 and thread 1 can be anything (a fundamental premise of the CUDA programming model), then thread 1 could execute completely before thread 0, in which case the addition of location 1 and 2 would have already happened, and the store to location 1 would have already happened, at the point that thread 0 begins to execute, and so it reads a value from location 1 which is potentially not the same as what it would have read in the lockstep case, or in other cases we could posit (such as the case where thread 0 executes completely before thread 1, rather than completely after thread 1). In either of these cases, the write (to location 1, by thread 1) happened before the read of location 1, by thread 0. This is a case that probably nobody would intend, if they were writing a sweep-style parallel reduction.

I know from Volta arch, every thread in a warp has its own PC. Here, you mean even if I do not write a branch code(if else statement) explicitly, I should not make assumption that all 32 threads in a warp execute in lockstep mode, for example, thread#0~thread#15 will be excuted in lockstep way, and then thread#16~thread#31 will be scheduled, right?
If so, why is it to be like that, things become more sophisticated, is it designed like that for good performance or is it because there is not enough hardware resource to excute 32 threads of a warp simultaneously?

Yes, that is what I am saying. You are not supposed to make that assumption, based on the CUDA programming model.

I’ve tried to address this already and apparently failed. You’ll need to separate in your thought process the idea of an abstract model from a physical implementation. The abstract model does not claim that threads will remain converged. But based on the physical implementation, I am not able to offer specific reasons why they would diverge when there is no reason to do so. I won’t be able to go any farther than that, and won’t be able to respond to further rephrasings of that question.

Later: If you are asking why was the Volta execution model introduced, it is well described here.

Here are some additional tangentially related items: 1 2 3

Thanks for your patient and helpful explainment. To summarize, CUDA programming mode gurantee all active threads in a warp will be executing in lockstep (SIMT principle), but does not gurantee all threads(32 typically) will be executing in lockstep, the order of each thread execution is not guranteed, so we should use warp level sync primitives if we want to do sync in a warp, correct me if I am wrong.

No I don’t think it makes that statement either.

That is the main idea.

embarrassing, :), but I still want to make it clear. Do you mean active threads in a warp may be executing different instructions? for example, at a specific time point, if there are 4 active threads, we may find that t0 and t1 are excuting instruction A, t2 and t3 are executing instruction B? and all of them obey SIMT rule, right?

No I don’t. Once again, we need to separate a discussion of the CUDA programming model from physical implementations. I don’t think the programming model makes a statement about whether two threads in the same warp may be executing different instructions in the same clock cycle. But I don’t know of any physical implementation (a currently available CUDA GPU) that would issue two separate instructions, to two separate threads, in the same warp, in the same cycle.

I see, there are two different views: hardware implementation and programming model.

But back to your saying:

It is still curious for me to know what situation make that diverge happen event if there is not any explicit branch code, :)