Warp synchronous programming

If dynamically allocated shared memory of a block is partitioned and assigned to warps of that block,

  1. can a member thread of a warp read the data correctly written by other member thread of the same warp without synchronization ?


It is not guaranteed, unless the pointer reference to the shared memory is declared with the “volatile” modifier.


regarding this:

“If dynamically allocated shared memory of a block is partitioned and assigned to warps of that block,”

there is no such concept. There is no mechanism to explicitly “partition” shared memory, and shared memory cannot be “assigned” to warps. It is shared by all threads within a block.

Thanks for your kind reply. I used volatile-qualifier to access both global and shared memory that are modified by threads (possibly threads from same warp or same block or any thread in the grid), even I used volatile-qualifier for memory locations only modified using atomics (CAS, Exch).

My program is completely warp centric, no synchronization among warps is necessary, but I still get error which happen rarely and in non-deterministic fashion.

Is there any clue ?
Please rescue me :(

try running your code with cuda-memcheck

study the documentation for cuda-memcheck


and also use the cuda-memcheck --racecheck tool

I just logically partition the shared memory assigned to block, say each warp within the block access specific location of shared memory determined by the id of the warp (thread_id/32), which is same for all threads in the warp.

int thread_id = (threadIdx.x + blockIdx.x * blockDim.x) + (threadIdx.y + blockIdx.y * blockDim.y) * blockDim.x * gridDim.x;

     int local_thread_id = threadIdx.x + threadIdx.y * blockDim.x;

    int space_for_each_warp= 128;
    int w_OFF = thread_id % 32; //offset of a thread in it's warp

    extern __shared__ int SMEM[]; //shared memory for a Block

    int*  my_part = SMEM + (local_thread_id / 32)*(space_for_each_warp);
    volatile int *memory_of_current_warp = my_part;

Say, each warp is assigned with 128 task and and 32 threads do so in parallel and collect the status of failed tasks in the shared memory conceptually assigned to it.

template<int W_SZ>
__device__ void collectUnsucc(int W_OFF, int isfinished, int task_to_repeat, volatile int *memory_of_current_warp, int *num_unsuccessful_task_original, int space_for_each_warp) {

   // isfinished is either 0 or 1

    volatile int *num_unsuccessful_case = num_unsuccessful_task_original; // make a volatile reference 
    int prefixsum = 1 - isfinished;
    for (int i = 1; i <= W_SZ / 2; i *= 2) {
        int lowerLaneValue = __shfl_up(prefixsum, i, W_SZ);
        if (W_OFF >= i)
            prefixsum += lowerLaneValue;

    if (isfinished == FALSE) {
        memory_of_current_warp [ (space_for_each_warp -1) - *num_unsuccessful_case - (prefixsum - 1)] = task_to_repeat;
    *num_unsuccessful_case += __shfl(prefixsum, W_SZ - 1, W_SZ); //MASTER DOES broadcast  total "num_unsuccessful_case" to all other member of the warp