I have my threads defined such that they are sequential over multiple blocks.
My problem is that only the first 512 threads enter a loop (for/while). The others seem to vanish.
THIS IS A HUGE PROBLEM FOR ME BECAUSE I NEED ALL THE THREADS TO ENTER A LOOP
I will show you what i mean. I have this defined for my threads to make them sequential:
/* Init threads */
int x = threadIdx.x;
int y = threadIdx.y;
int z = threadIdx.z;
/* Init dimension of block (how many threads */
int bdx = blockDim.x;
int bdy = blockDim.y;
int bdz = blockDim.z;
/* Init block indexs */
int bx = blockIdx.x;
int by = blockIdx.y;
/* init dimension of grid (how many blocks */
int gdx = gridDim.x;
tid= by*gdx*(bdx*bdy*bdz) + bx*(bdx*bdy*bdz) + z*(bdx*bdy) + bdx*y +x;
Now this works perfectly! all threads have sequential numbering. I can prove that this works by inserting the following code into the kernel (im using emu mode to debug)
if(tid==0)printf("-------HERE ARE THE THREADS-----\n");
printf("|%i",tid);
__syncthreads();
if(tid==0)printf("\n");
}
If the kernel was given the configuration <<<(2,2),512>>> such that we can expect 2048 threads to be created (22512=2048)
then the output is:
|0|1|2|3|4|5|6|7|8|9|10|11…all the way to…2042|2043|2044|2045
|2046|2047
GREAT!! This is what it is supposed to do.
Now put that same code snippet into the insides of a loop (within the kernel). It seems as though only the first 512 threads enter the loop. The output is:
|0|1|2|3|4|5|6|7|8|9|10|11…all the way to…506|507|508|509|510|511
You did not make them really sequential, but give each thread an unique number. Which is good of course: this way you can for example index an unique position in a array or matrix. Which point in this array/matrix will be used first is of course undefined.
To simulate this you need to either comment out the first call to show() (to show the threads in the loop) OR you need to comment out the entire loop and uncomment the show() above it.
You will fail to see what i am talking about if you uncomment both!
(Run this in emulation mode due to the printf statements)
DONT FORGET TO PUT THE ‘-arch sm_11’ FLAG WHEN COMPILING DUE TO THE CALL TO atomicAdd().
__device__ void show(){
/* Init threads */
int x = threadIdx.x;
int y = threadIdx.y;
int z = threadIdx.z;
/* Init dimension of block (how many threads) */
int bdx = blockDim.x;
int bdy = blockDim.y;
int bdz = blockDim.z;
/* Init block indexs */
int bx = blockIdx.x;
int by = blockIdx.y;
/* init dimension of grid (how many blocks) */
int gdx = gridDim.x;
/* Get sequential index for threads spanning multiple blocks */
int tid= by*gdx*(bdx*bdy*bdz) + bx*(bdx*bdy*bdz) + z*(bdx*bdy) + bdx*y +x;
printf("%i|",tid);
__syncthreads();
}
__global__ void mykernel(){
int x = threadIdx.x;
int y = threadIdx.y;
int z = threadIdx.z;
int bx = blockIdx.x;
int by = blockIdx.y;
int eq=0;
/* This will print 0|1|2|3|.......|2046|2047| */
//show(); //UNCOMMENT THIS OUT TO SHOW THE FULL THREADCOUNT
// COMMENT THE ENTIRE WHILE LOOP OUT /* */ AND UNCOMMENT OUT
// THE CALL TO show() (ABOVE) TO SEE THE FULL THREAD COUNT
while(eq<4){
// if tid == 0
if(x==0 && y==0 && z==0 && bx==0 && by==0){
atomicAdd((int*)&eq,1);
printf("\n========LOOP NUMBER: %i===========\n",eq);
}
// For some reason this only gives 0|1|2|....|509|510|511|
show();
}
}
int main(){
// Total of 2048 threads
dim3 dimBlock(8,8,8);
dim3 dimGrid(2,2);
// Launch kernel with 2*2*8*8*8 = 2048 threads
mykernel<<<dimGrid,dimBlock>>>();
// Wait for threads to come out of the kernel
cudaThreadSynchronize();
printf("\n");
}
I’m suprised this works. You do an atomic add on a locale variable, why? And you do this only for thread with id (0,0,0) in block(0,0), so the other threads will never reach eq=4. Or is atomicAdd((int*)&eq,1); automagically mapped to some shared/global memory space?
This is wrong in a lot of places, and it is a good example of how device emulation is not actually doing what the card is doing (I think, I’m making assumptions about how device emulation is working given undefined behavior–in future versions, it might just start displaying the Amiga bouncing ball demo)…
atomicAdd operates on a global memory location or a shared memory location (as of Compute 1.2). You’re using it on a register? That’s not really meaningful and is certainly undefined! (as a result, when running this on the card, it can display the bouncing ball demo! that is legitimate behavior!)
So, if eq is a register, operating on it atomically doesn’t make sense. But then, your condition doesn’t make any sense, either! It’s true for exactly one thread in your grid. I think you’ve confused device emulation, and it’s assuming eq is actually a shared variable–incorrect, but hey, behavior is undefined!
And if you try to do some form of global synchronization by making a single thread perform some operation and having every other block and thread wait on that, it won’t work as you expect. Either that first block will execute and do what it needs to do and basically everything won’t synchronize, or more likely you’ll deadlock the card. (I need to make a FAQ, and this should be question number one because it occurs to everyone and then everyone realizes or is told that it is a bad idea.)
Man, I wish device emulation would randomly display that old demo. That would make using it a lot more fun. I love that demo.
When I ran this on Vista, it deadlocked the card and the driver reset. (yeah, see why the bouncing ball demo would have been better?)
edit: I hope this didn’t come across as harsh, I was just thinking about “things that would be awesome if caused by undefined behavior.” number two on that list? ponies. but yeah, just review the hardware execution model and check where you can use the intrinsics a bit more carefully, and hopefully things will make sense then.
Ok. Ill try again, this time using a for-loop, and a global variable.
It still errors
Try this:
#include <stdio.h>
#include <stdlib.h>
#include <cutil.h>
__device__ void show(){
/* Init threads */
int x = threadIdx.x;
int y = threadIdx.y;
int z = threadIdx.z;
/* Init dimension of block (how many threads) */
int bdx = blockDim.x;
int bdy = blockDim.y;
int bdz = blockDim.z;
/* Init block indexs */
int bx = blockIdx.x;
int by = blockIdx.y;
/* init dimension of grid (how many blocks) */
int gdx = gridDim.x;
/* Get sequential index for threads spanning multiple blocks */
int tid= by*gdx*(bdx*bdy*bdz) + bx*(bdx*bdy*bdz) + z*(bdx*bdy) + bdx*y +x;
printf("%i|",tid);
__syncthreads();
}
__global__ void mykernel(int *gvar){
//show(); //UNCOMMENT THIS OUT TO SHOW THE FULL THREADCOUNT (IF THE LOOP BELOW IS COMMENTED OUT)
for(*gvar=0;*gvar<5;*gvar++){
if(threadIdx.x==0&&threadIdx.y==0&&threadIdx.z==0&&blockIdx.x==0&&blockIdx.y==0)
printf("\n========LOOP NUMBER===========\n");
show();
}
}
int main(){
int *gvar;
// Total of 2048 threads
dim3 dimBlock(8,8,8);
dim3 dimGrid(2,2);
cudaMalloc((void**)&gvar,sizeof(int));
// Launch kernel with 2*2*8*8*8 = 2048 threads
mykernel<<<dimGrid,dimBlock>>>(gvar);
// Wait for threads to come out of the kernel
cudaThreadSynchronize();
printf("\n");
cudaFree(gvar);
}
Consider what happens if multiple threads access that (gvar) concurrently–some threads will read old values and try to store an incorrect value. In other words, race condition!
I agree with tmurray: please (re-)read the programming guide and try to understand the different algorithms in the SDK. This will help you solve your problems, as there is no magic trick to ‘do a loop without race conditions’ etc.
So, just guessing–it outputs five times. Device emu runs threads in a block in order then switches to a different block when a __syncthreads() is reached. So what happens is
block 0:
enter loop: gvar = 0
thread 0 prints loop number 0, show, increment gvar
next loop: gvar = 1, repeat
…
next loop: gvar = 5, exit loop
block 0 finishes
block 1:
enter loop: gvar = 5, whoops we never see threads 512 through 1023
block 2:
enter loop: gvar = 5, whoops we never see threads 1024 through 1535
block 3:
enter loop: gvar = 5, whoops we never see threads 1536 through 2047
On the device, all four blocks will start at once, and the behavior will be completely random (totally dependent on what get scheduled where and when, which can change from execution to execution).
Does that make sense?
It sounds like this is your first real exposure to concurrent programming; you probably need a stronger background before CUDA will make a lot of sense to you. Not sure what the best resource is for that, but a generic introduction to multithreaded programming and synchronization would probably contain information on deadlock and race conditions.
To show you that it’s your code that’s broken, this uses a loop counter that’s local to each thread instead of a global counter or anything like that, so no race conditions.
#include <stdlib.h>
#include <stdio.h>
__device__ void show(){
   /* Init threads */
   int x = threadIdx.x;
   int y = threadIdx.y;
   int z = threadIdx.z;
   /* Init dimension of block (how many threads) */
   int bdx = blockDim.x;
   int bdy = blockDim.y;
   int bdz = blockDim.z;
   /* Init block indexs */
   int bx = blockIdx.x;
   int by = blockIdx.y;
   /* init dimension of grid (how many blocks) */
   int gdx = gridDim.x;
   /* Get sequential index for threads spanning multiple blocks */
   int tid= by*gdx*(bdx*bdy*bdz) + bx*(bdx*bdy*bdz) + z*(bdx*bdy) + bdx*y +x;
   printf("%i|",tid);
   __syncthreads();
}
__global__ void mykernel(){
   //show(); //UNCOMMENT THIS OUT TO SHOW THE FULL THREADCOUNT (IF THE LOOP BELOW IS COMMENTED OUT)
 int gvar = 0;
   for(gvar=0;gvar<5;gvar++){
       if(threadIdx.x==0&&threadIdx.y==0&&threadIdx.z==0&&blockIdx.x==0&&blockIdx.y==0)
           printf("\n========LOOP NUMBER===========\n");
       show();
   }
}
int main(){
   // Total of 2048 threads
   dim3 dimBlock(8,8,8);
   dim3 dimGrid(2,2);
   // Launch kernel with 2*2*8*8*8 = 2048 threads
   mykernel<<<dimGrid,dimBlock>>>();
   // Wait for threads to come out of the kernel
   cudaThreadSynchronize();
   printf("\n");
}
it outputs a whole lot of stuff, but you’ll see that it does end with …|2047.