threads in a loop threads go missing

I have my threads defined such that they are sequential over multiple blocks.

My problem is that only the first 512 threads enter a loop (for/while). The others seem to vanish.

THIS IS A HUGE PROBLEM FOR ME BECAUSE I NEED ALL THE THREADS TO ENTER A LOOP

I will show you what i mean. I have this defined for my threads to make them sequential:

       /* Init threads */

        int x = threadIdx.x;

        int y = threadIdx.y;

        int z = threadIdx.z;

       /* Init dimension of block (how many threads */

        int bdx = blockDim.x;

        int bdy = blockDim.y;

        int bdz = blockDim.z;

       /* Init block indexs */

        int bx = blockIdx.x;

        int by = blockIdx.y;

       /* init dimension of grid (how many blocks */

        int gdx = gridDim.x;

tid= by*gdx*(bdx*bdy*bdz) + bx*(bdx*bdy*bdz) + z*(bdx*bdy) + bdx*y +x;

Now this works perfectly! all threads have sequential numbering. I can prove that this works by inserting the following code into the kernel (im using emu mode to debug)

       if(tid==0)printf("-------HERE ARE THE THREADS-----\n");

        printf("|%i",tid);

        __syncthreads();

        if(tid==0)printf("\n");

}

If the kernel was given the configuration <<<(2,2),512>>> such that we can expect 2048 threads to be created (22512=2048)

then the output is:

|0|1|2|3|4|5|6|7|8|9|10|11…all the way to…2042|2043|2044|2045

|2046|2047

GREAT!! This is what it is supposed to do.

Now put that same code snippet into the insides of a loop (within the kernel). It seems as though only the first 512 threads enter the loop. The output is:

|0|1|2|3|4|5|6|7|8|9|10|11…all the way to…506|507|508|509|510|511

Can anyone help me. This is a HUGE problem for me

You did not make them really sequential, but give each thread an unique number. Which is good of course: this way you can for example index an unique position in a array or matrix. Which point in this array/matrix will be used first is of course undefined.

Could you post the entire code?

FULL TEST CASE CODE

basically look at the kernel: ‘mykernel()’

To simulate this you need to either comment out the first call to show() (to show the threads in the loop) OR you need to comment out the entire loop and uncomment the show() above it.

You will fail to see what i am talking about if you uncomment both!

(Run this in emulation mode due to the printf statements)

DONT FORGET TO PUT THE ‘-arch sm_11’ FLAG WHEN COMPILING DUE TO THE CALL TO atomicAdd().

__device__ void show(){

        /* Init threads */

        int x = threadIdx.x;

        int y = threadIdx.y;

        int z = threadIdx.z;

       /* Init dimension of block (how many threads) */

        int bdx = blockDim.x;

        int bdy = blockDim.y;

        int bdz = blockDim.z;

       /* Init block indexs */

        int bx = blockIdx.x;

        int by = blockIdx.y;

       /* init dimension of grid (how many blocks) */

        int gdx = gridDim.x;

       /* Get sequential index for threads spanning multiple blocks */

        int tid= by*gdx*(bdx*bdy*bdz) + bx*(bdx*bdy*bdz) + z*(bdx*bdy) + bdx*y +x;

       printf("%i|",tid);

        __syncthreads();

}

__global__ void mykernel(){

        int x = threadIdx.x;

        int y = threadIdx.y;

        int z = threadIdx.z;

        int bx = blockIdx.x;

        int by = blockIdx.y;

       int eq=0;

       /* This will print 0|1|2|3|.......|2046|2047| */

        //show(); //UNCOMMENT THIS OUT TO SHOW THE FULL THREADCOUNT

       // COMMENT THE ENTIRE WHILE LOOP OUT /* */ AND UNCOMMENT OUT

        // THE CALL TO show() (ABOVE) TO SEE THE FULL THREAD COUNT

        while(eq<4){

                // if tid == 0

                if(x==0 && y==0 && z==0 && bx==0 && by==0){

                        atomicAdd((int*)&eq,1);

                        printf("\n========LOOP NUMBER: %i===========\n",eq);

                }

                // For some reason this only gives 0|1|2|....|509|510|511|

                show();

        }

}

int main(){

       // Total of 2048 threads

        dim3 dimBlock(8,8,8);

        dim3 dimGrid(2,2);

       // Launch kernel with 2*2*8*8*8 = 2048 threads

        mykernel<<<dimGrid,dimBlock>>>();

       // Wait for threads to come out of the kernel

        cudaThreadSynchronize();

       printf("\n");

}

(DELETED POST)

I’m suprised this works. You do an atomic add on a locale variable, why? And you do this only for thread with id (0,0,0) in block(0,0), so the other threads will never reach eq=4. Or is atomicAdd((int*)&eq,1); automagically mapped to some shared/global memory space?

This is wrong in a lot of places, and it is a good example of how device emulation is not actually doing what the card is doing (I think, I’m making assumptions about how device emulation is working given undefined behavior–in future versions, it might just start displaying the Amiga bouncing ball demo)…

  1. atomicAdd operates on a global memory location or a shared memory location (as of Compute 1.2). You’re using it on a register? That’s not really meaningful and is certainly undefined! (as a result, when running this on the card, it can display the bouncing ball demo! that is legitimate behavior!)

  2. So, if eq is a register, operating on it atomically doesn’t make sense. But then, your condition doesn’t make any sense, either! It’s true for exactly one thread in your grid. I think you’ve confused device emulation, and it’s assuming eq is actually a shared variable–incorrect, but hey, behavior is undefined!

  3. And if you try to do some form of global synchronization by making a single thread perform some operation and having every other block and thread wait on that, it won’t work as you expect. Either that first block will execute and do what it needs to do and basically everything won’t synchronize, or more likely you’ll deadlock the card. (I need to make a FAQ, and this should be question number one because it occurs to everyone and then everyone realizes or is told that it is a bad idea.)

  4. Man, I wish device emulation would randomly display that old demo. That would make using it a lot more fun. I love that demo.

  5. When I ran this on Vista, it deadlocked the card and the driver reset. (yeah, see why the bouncing ball demo would have been better?)

edit: I hope this didn’t come across as harsh, I was just thinking about “things that would be awesome if caused by undefined behavior.” number two on that list? ponies. but yeah, just review the hardware execution model and check where you can use the intrinsics a bit more carefully, and hopefully things will make sense then.

Ok. Ill try again, this time using a for-loop, and a global variable.

It still errors

Try this:

#include <stdio.h>

#include <stdlib.h>

#include <cutil.h>

__device__ void show(){

        /* Init threads */

        int x = threadIdx.x;

        int y = threadIdx.y;

        int z = threadIdx.z;

       /* Init dimension of block (how many threads) */

        int bdx = blockDim.x;

        int bdy = blockDim.y;

        int bdz = blockDim.z;

       /* Init block indexs */

        int bx = blockIdx.x;

        int by = blockIdx.y;

       /* init dimension of grid (how many blocks) */

        int gdx = gridDim.x;

       /* Get sequential index for threads spanning multiple blocks */

        int tid= by*gdx*(bdx*bdy*bdz) + bx*(bdx*bdy*bdz) + z*(bdx*bdy) + bdx*y +x;

       printf("%i|",tid);

        __syncthreads();

}

__global__ void mykernel(int *gvar){

       //show(); //UNCOMMENT THIS OUT TO SHOW THE FULL THREADCOUNT (IF THE LOOP BELOW IS COMMENTED OUT)

       for(*gvar=0;*gvar<5;*gvar++){

                if(threadIdx.x==0&&threadIdx.y==0&&threadIdx.z==0&&blockIdx.x==0&&blockIdx.y==0)

                        printf("\n========LOOP NUMBER===========\n");

                show();

        }

}

int main(){

        int *gvar;

       // Total of 2048 threads

        dim3 dimBlock(8,8,8);

        dim3 dimGrid(2,2);

       cudaMalloc((void**)&gvar,sizeof(int));

       // Launch kernel with 2*2*8*8*8 = 2048 threads

        mykernel<<<dimGrid,dimBlock>>>(gvar);

       // Wait for threads to come out of the kernel

        cudaThreadSynchronize();

       printf("\n");

cudaFree(gvar);

}

Consider what happens if multiple threads access that (gvar) concurrently–some threads will read old values and try to store an incorrect value. In other words, race condition!

Ok that aside. Pretending that i did have a loop without problems; what the hell is happening to the threads? why do they only reach 512?

side note: how do you do a loop without race conditions?

I agree with tmurray: please (re-)read the programming guide and try to understand the different algorithms in the SDK. This will help you solve your problems, as there is no magic trick to ‘do a loop without race conditions’ etc.

Ok fine. Put the question about race conditioned loops aside.

Why are the threads only reaching 512 if put inside a loop?

It is something that the programming guide doesnt specify.

So, just guessing–it outputs five times. Device emu runs threads in a block in order then switches to a different block when a __syncthreads() is reached. So what happens is

block 0:

  • enter loop: gvar = 0

  • thread 0 prints loop number 0, show, increment gvar

  • next loop: gvar = 1, repeat

  • next loop: gvar = 5, exit loop

  • block 0 finishes

block 1:

  • enter loop: gvar = 5, whoops we never see threads 512 through 1023

block 2:

  • enter loop: gvar = 5, whoops we never see threads 1024 through 1535

block 3:

  • enter loop: gvar = 5, whoops we never see threads 1536 through 2047

On the device, all four blocks will start at once, and the behavior will be completely random (totally dependent on what get scheduled where and when, which can change from execution to execution).

Does that make sense?

It sounds like this is your first real exposure to concurrent programming; you probably need a stronger background before CUDA will make a lot of sense to you. Not sure what the best resource is for that, but a generic introduction to multithreaded programming and synchronization would probably contain information on deadlock and race conditions.

Yes that makes total sense now.

Thanks heaps!

To show you that it’s your code that’s broken, this uses a loop counter that’s local to each thread instead of a global counter or anything like that, so no race conditions.

#include <stdlib.h>

#include <stdio.h>

__device__ void show(){

 Â  Â  Â  /* Init threads */

 Â  Â  Â  int x = threadIdx.x;

 Â  Â  Â  int y = threadIdx.y;

 Â  Â  Â  int z = threadIdx.z;

  Â  Â  /* Init dimension of block (how many threads) */

 Â  Â  Â  int bdx = blockDim.x;

 Â  Â  Â  int bdy = blockDim.y;

 Â  Â  Â  int bdz = blockDim.z;

  Â  Â  /* Init block indexs */

 Â  Â  Â  int bx = blockIdx.x;

 Â  Â  Â  int by = blockIdx.y;

  Â  Â  /* init dimension of grid (how many blocks) */

 Â  Â  Â  int gdx = gridDim.x;

  Â  Â  /* Get sequential index for threads spanning multiple blocks */

 Â  Â  Â  int tid= by*gdx*(bdx*bdy*bdz) + bx*(bdx*bdy*bdz) + z*(bdx*bdy) + bdx*y +x;

  Â  Â  printf("%i|",tid);

 Â  Â  Â  __syncthreads();

}

__global__ void mykernel(){

  Â  Â  //show(); //UNCOMMENT THIS OUT TO SHOW THE FULL THREADCOUNT (IF THE LOOP BELOW IS COMMENTED OUT)

 Â int gvar = 0;

 Â  Â  Â  for(gvar=0;gvar<5;gvar++){

 Â  Â  Â  Â  Â  Â  Â  if(threadIdx.x==0&&threadIdx.y==0&&threadIdx.z==0&&blockIdx.x==0&&blockIdx.y==0)

 Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  printf("\n========LOOP NUMBER===========\n");

 Â  Â  Â  Â  Â  Â  Â  show();

 Â  Â  Â  }

}

int main(){

  Â  Â  // Total of 2048 threads

 Â  Â  Â  dim3 dimBlock(8,8,8);

 Â  Â  Â  dim3 dimGrid(2,2);

  Â  Â  // Launch kernel with 2*2*8*8*8 = 2048 threads

 Â  Â  Â  mykernel<<<dimGrid,dimBlock>>>();

  Â  Â  // Wait for threads to come out of the kernel

 Â  Â  Â  cudaThreadSynchronize();

  Â  Â  printf("\n");

}

it outputs a whole lot of stuff, but you’ll see that it does end with …|2047.