problem with accessing device variables from the gpu-cores of multiple gpu blocks in compute capabi

In this following code segment, I was trying to exit all the threads from the infinite loop, where only thread 0 of blockId 0 update the *end=32 which is initially get the value from blockDim.x (*end=16). But this program stuck in this infinite loop. I think, therads of blockId 1 are not getting the updated value, so they stuck in the infinite loop. But from the description of CUDA tutorial it is clearly mention that any thing in device memory is visible from all the threads of different blocks.

I am not sure weather there are multiple copies of *end for the two blocks. Or is there any other solution, so that all the threads of different blocks can exit this infinite loop, only by the updating the *end by the therad of only one block of threads.

I was trying to do this program using device , where device qualifier was used for start and end but same thing happening.

Below is the description of my device.

There are 2 devices supporting CUDA

Device 0: “GeForce GTX 560 Ti”

CUDA Driver Version: 4.0

CUDA Capability Major/Minor version number: 2.1

Device 1: “Quadro 600”

CUDA Driver Version: 4.0

CUDA Capability Major/Minor version number: 2.1

But this program ran for folowing the following device and did not stuck in the infinite loop…

Device 0: “GeForce 320M”

CUDA Driver Version: 4.0

CUDA Capability Major/Minor version number: 1.2

Here, is the code. Can anyone please give me a solution.

#define N 32

__global__  void test_program (int *start, int *end, int *first_element, int *last_element){

*start=0;

        *end=blockDim.x; //*end get the initial value 16

        int idx;

        int u_idx;

while(1){

u_idx=blockIdx.x*blockDim.x+threadIdx.x;

               //only threads 0 to 15 of blockId 0 can satisfy this condition. 

               if(u_idx>=*start && u_idx < *end){

idx=threadIdx.x;

                        //only thread 0 of blockId 0 updates *end to 32

                        if(idx==0){

                                *end=32;

                        }

                }

__syncthreads();

                //exit condition

                if(*end==32){

                        break;

                }

}

}

int main(void)

{

  float elapsed_time_ms = 0.0f;

  cudaEvent_t start, stop;

  cudaEventCreate( &start );

  cudaEventCreate( &stop );

int *start_d;

  int *end_d;

  int *first_element;

  int *last_element;

cudaMalloc ((void**)&start_d,sizeof(int));

  cudaMalloc ((void**)&end_d,sizeof(int));

  cudaMalloc ((void**)&first_element,sizeof(int));

  cudaMalloc ((void**)&last_element,sizeof(int));

int nBlocks = 2;

  int blockSize=16;

cudaEventRecord( start, 0);

  //execution configuration: 2 blocks of threads, each block has 16 threads

  test_program <<< nBlocks, blockSize >>> (start_d,end_d,first_element,last_element);

cudaEventRecord( stop, 0);

  cudaEventSynchronize( stop );

  cudaEventElapsedTime( &elapsed_time_ms, start, stop );

  printf("time %f\n",elapsed_time_ms);

cudaFree(start_d);

  cudaFree(end_d);

  cudaFree(first_element);

  cudaFree(last_element);

}

It would seem to me that it’s possible, since so little work is being done here, that all the threads in block 0 could finish before block 1 is ever scheduled. If that’s true, then *end would have a value of 32 when block 1 is scheduled, but block 1 threads would set it back to 16. Then they would be stuck in an infinite loop because thread 0 in Block 0 has already exited. On the GeForce 320M, the scheduling is probably done differently, so you see a different result. The syncthreads call is supposed to synch threads within a block, not between blocks, so that’s lends some credence to the idea that Block 0 exits before Block 1 is scheduled. I certainly don’t know the internals of scheduling, just offering this as a possible explanation.

Thanks Bunny Fair for your reply.

I want threads of block 0 to set *end=32 and then threads of block 1 to get the updated value and exit loop. I just need something so that updated (*end=32) by one block of threads can be visible from all threads of different blocks and exit the loop.

The only suggestions I have are to initialize *end to blockdim.x outside of the kernel so that the threads aren’t all clobbering it and, if the compiler allows it, declare *end as volatile. Volatile might not be needed since *end is a pointer which doesn’t change. But I can’t tell if the dereferenced value is copied into a local register or not so, if it is, volatile could help you avoid that.

Your solution will not work. There is no guarantee on the scheduling order.
If you write a code that is assuming specific order in the block execution, you may end up with deadlocks.

I think, I don’t need to worry about the scheduling because of the “if” condition.

u_idx=blockIdx.x*blockDim.x+threadIdx.x;

               //only threads 0 to 15 of blockId 0 can satisfy this condition. 

               if(u_idx>=*start && u_idx < *end){

This if condition only allows threads 0 to 15 of blockId 0 to be entered into this “if”. Threads 0 to 15 of blockId 1 can only enter into this if whenever *start=*end and *end=32;

idx=threadIdx.x;

                        //only thread 0 of blockId 0 updates *end to 32

                        if(idx==0){

                                  *start=*end;

                                  *end=32;

                        }

This code is running fine and does not stuck into infinite loop with the following device.

Device 0: “GeForce 320M”

CUDA Driver Version: 4.0

CUDA Capability Major/Minor version number: 1.2

But it does stuck into infinite loop with the following device.

Below is the description of my device.

There are 2 devices supporting CUDA

Device 0: “GeForce GTX 560 Ti”

CUDA Driver Version: 4.0

CUDA Capability Major/Minor version number: 2.1

Device 1: “Quadro 600”

CUDA Driver Version: 4.0

CUDA Capability Major/Minor version number: 2.1

I was trying to implement the same thing using device variable and it works fine for the following device and it did not stuck into the infinite loop.

Device 0: “GeForce 320M”

CUDA Driver Version: 4.0

CUDA Capability Major/Minor version number: 1.2

Below is the kernel code using device qualifier.

#include<cuda.h>

#include<cuda_runtime.h>

#define N 32

__device__ int start=0;

__device__ int end;

__global__  void parallel_scan (){

start=0;

        end=blockDim.x;

        int idx;

        int u_idx;

while(1){

u_idx=blockIdx.x*blockDim.x+threadIdx.x;

                if(u_idx>=start && u_idx < end){

idx=threadIdx.x;

                        if(idx==0){

                                start=end;

                                end=32;

                        }

                }

__syncthreads();

                if(end==32){

                        break;

                }

}

}

But again, it does not working fine and stucks into the infinite loop for the following device.

Device 0: “GeForce GTX 560 Ti”

CUDA Driver Version: 4.0

CUDA Capability Major/Minor version number: 2.1

Device 1: “Quadro 600”

CUDA Driver Version: 4.0

CUDA Capability Major/Minor version number: 2.1

My question is why this code stucks into infinite loop for compute capability 2.1. and is there only one copy of “start” and “end” for the entire grid or there are multiple copies of “start” and “end” for each indivudial blocks of gpu threads? Is there any alternative solutions to do this?