CUDA warp-synchronous programming question

Hello everybody.

I have the question concerning the warp-synchronous programming in the presence of the global-to-shared memory transfers. Let’s consider the following piece of code:

struct KDTreeNode {
    int splitAxis;
    float splitPos;
    int trIndLo, trIndHi;
    int left, right;
    int reserved[10];
};

...

int tid = threadIdx.x;
int wid = tid >> 5;
volatile __shared__ int NTraversed[NUMBER_OF_WARPS];
volatile __shared__ KDTreeNode nodes[NUMBER_OF_WARPS];

...

if ((tid & 31) < 16)
    ((int *)&nodes)[(wid << 4) + (tid & 31)] = dev_kdTreeGlobal[(NTraversed[wid] << 4) + (tid & 31)];

// __threadfence();
// __threadfence_block();
// __syncthreads();

while (nodes[wid].splitAxis != LEAF) {
    ...
}

The above mentioned code snippset is obviously an example of the warp-synchronous programming. As we read in the “NVIDIA CUDA Programming Guide”, warps are executing physically in parallel. My question is: can I make an assumption, that the coalesced read performed by all threads of the half-warp in the lines 18, 19 is visible from the point of view of the condition in the line 25, where we refer to the field “splitAxis” or maybe one have to uncomment one of the synchronization functions (lines: 21, 22, 23) to ensure the code correctness?

Thanks in advance.

In line 19, nodes is already a pointer. Why would you take the address of nodes and turn that into an (int *) pointer? When you dereference that once, as you are doing (), as an example, warp 0 thread 0 will attempt to overwrite the nodes pointer itself (treating it as an int quantity).

Yes, line 25 for a given warp would be able to “see” the results of lines 18/19 executed by that same warp, without any additional synchronization. Line 25 of a different warp will not necessarily see the results of lines 18/19. However since lines 18/19 are (somehow??) updating some object of nodes that is not nodes[wid], I don’t see the point of your question.

Your code just looks broken to me, in particular line 19. In fact, assuming dev_kdTreeGlobal is not a pointer to int, I would expect the compiler to throw a warning or error on that line. Have you compiled some variant of this code?

Thank You for Your reply. As for the line 19, You’re right - it’s my mistake. I hurried up when I was writing this post.

As far as the code is concerned, yes, it’s is the part of the ray tracer beeing the part of my MA thesis. I use such kind a pointers for convenience - (int *) to perform the coalesced global memory access and (KDTreeNode *) to manipulate with the field of the structure, that has been just read from the global memory.

By the way, it seems that I have discovered the very strange error, that appears during the transfer from the texture to the shared memory with the volatile modifier. In the spite of the warp-synchronous programming, there was an infinite loop until I put the proper __threadfence_block() instruction.