this code resets my computer

Hi

I can some one tell my why this code freezes and resents my computer.
I am using 4 blocks to execute it and my gpu has 4 microprocessors

global static void P7Viterbi_cuda_device_loop_two(int *imx, int *mmx, int *dmx, int *no_elements_Y_device, int *no_elements_X_device, int *xmx, unsigned char *dsq, int * xsc, int *tsc, int * bsc, int * msc, int isc, int esc, int * lock){

int tx = (blockDim.x * blockIdx.x) + threadIdx.x;

int k;
int count = 0;

.
.
.

	for (k = 1; k <= *no_elements_Y_device; k++) {
		  
		  count = count + 4;
	.
        .
            .
            .
	  
		  __syncthreads();
		  
		  if(threadIdx.x == 1)
			atomicAdd(lock, 1);
			
		  while(count != *lock);
		  
	}

}

You have a while with ; behind it, inside a for loop?
what is inside the while loop?

it looks like you have lots of possibilities for deadlocks as far as I can see.

This is an very interesting topic. Because I also had some resets of my computer for no apparent reason. So maybe this is a bug in CUDA… I still don’t know why my code resets my computer sometimes. It does not reach the 5sec. watchdog but resets before this.

BTW I’m not using any while or for loops in my kernel / code

Please tell us.

Hi thank you for you response.

The propose of that while in there is to stop the treads and continue until the 4 blocks have reach the same point.

actually if I remove the for loop, they while does not causes problem and the code executes perfectly. However, i need that for loop

is this valid??

please, anyone knows what is going on?
why is this getting stuck and freezing my computer

Can you post an indented version between CODE tags? That will be more readable, so it might be easier to see what you are trying to do.

Actually, i removed all the rest of the code and still freezes and I am trying it to run just as follows.

int tx = (blockDim.x * blockIdx.x) + threadIdx.x;

int k;
int count = 0;

for (k = 1; k <= *no_elements_Y_device; k++) {

count = count + 4;

__syncthreads();

if(threadIdx.x == 1)
atomicAdd(lock, 1);

while(count != *lock);

}
}

The full code supposes to be like this:

global static void P7Viterbi_cuda_device_loop_two(int *imx, int *mmx, int *dmx, int *no_elements_Y_device, int *no_elements_X_device, int *xmx, unsigned char *dsq, int * xsc, int *tsc, int * bsc, int * msc, int isc, int esc, int * lock){

		int tx = (blockDim.x * blockIdx.x) + threadIdx.x;
		
		int i = tx * (*no_elements_Y_device + 2);
		int i_minus_one = (tx-1) * (*no_elements_Y_device + 2);
		
		int i_xmx = tx * 4;
		int i_xmx_minus_one = (tx-1) * 4;
		
		int k, count = 0;
	
	  int sc;
	  int  *mc, *dc, *ic;        /* pointers to rows of mmx, dmx, imx */
	  int  *ms, *is;             /* pointers to msc[i], isc[i] */
	  int  *mpp, *mpc, *ip;      /* ptrs to mmx[i-1], mmx[i], imx[i-1] */
	  int  *bp;		     /* ptr into bsc[] */
	  int  *ep;                  /* ptr into esc[] */
	  int   xmb;		     /* value of xmx[i-1][XMB] */
	  int   xme;                 /* max for xmx[i][XME] */
	  int  *dpp;                 /* ptr into dmx[i-1] (previous row) */
	  int  *tpmm, *tpmi, *tpmd, *tpim, *tpii, *tpdm, *tpdd; /* ptrs into tsc */

if((i > 0) && (i <= ((*no_elements_Y_device + 2)* (*no_elements_X_device)))){		
		
		tpmm  = &tsc[TMM];
		tpim  = &tsc[TIM];
		tpdm  = &tsc[TDM];
		tpmd  = &tsc[TMD];
		tpdd  = &tsc[TDD];
		tpmi  = &tsc[TMI];
		tpii  = &tsc[TII];
		bp    = bsc;
		
		mc    = &mmx[i];    
		dc    = &dmx[i];
		ic    = &imx[i];
		mpp   = &mmx[i_minus_one];
		dpp   = &dmx[i_minus_one];
		ip    = &imx[i_minus_one];
		xmb   = xmx[i_xmx + XMB];
		ms    = &msc[dsq[tx]];
		is    = &isc[dsq[tx]];
		mc[0] = -INFTY;
		dc[0] = -INFTY;
		ic[0] = -INFTY;

	for (k = 1; k <= *no_elements_Y_device; k++) {
		  
		  count = count + 4;
		  
		  mc[k] = mpp[k-1]   + tpmm[k-1];
		  if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;
		  if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;
		  if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc; 
		  mc[k] += ms[k];
		  if (mc[k] < -INFTY) mc[k] = -INFTY;  

		  dc[k] = dc[k-1] + tpdd[k-1];
		  if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
		  if (dc[k] < -INFTY) dc[k] = -INFTY;  

		  if (k < (*no_elements_Y_device)) {
		  ic[k] = mpp[k] + tpmi[k];
		  if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; 
		  ic[k] += is[k];
		  if (ic[k] < -INFTY) ic[k] = -INFTY; 
		  }
		  
		  __syncthreads();
		  
		  if(threadIdx.x == 1)
			atomicAdd(lock, 1);
			
		  while(count != *lock);
		  
	}
}

basically the [i][k] depends on [i-1][k-1] so I am trying to use the 4 microprocessors (using 4 blocks) in a synchronized manner to compute [all i’s][k], then [all i’s][k +1]…then [all i’s][k + 2] and so on.

Basically, when a block finishes with all k’s will add 1 to lock (we have 4 blocks so count will be 4 on the first loop) then when count == lock means that all blocks have computed all i’s for that k and can go compute the all i’s for the next (k+1) and again each block will add 1 to lock to make 8 …

get it??

Can you post the host code that calls the simplified kernel.

Use the CODE button above your textfield, it really helps ;)

__global__ static void P7Viterbi_cuda_device_loop_two(int *imx, int *mmx, int *dmx, int *no_elements_Y_device, int *no_elements_X_device, int *xmx, unsigned char *dsq, int * xsc, int *tsc, int * bsc, int * msc, int *isc, int* esc, int * lock){

int tx = (blockDim.x * blockIdx.x) + threadIdx.x;

int i = tx * (*no_elements_Y_device + 2);

int i_minus_one = (tx-1) * (*no_elements_Y_device + 2);

int i_xmx = tx * 4;

int i_xmx_minus_one = (tx-1) * 4;

int k, count = 0;

int sc;

int *mc, *dc, *ic; /* pointers to rows of mmx, dmx, imx */

int *ms, *is; /* pointers to msc[i], isc[i] */

int *mpp, *mpc, *ip; /* ptrs to mmx[i-1], mmx[i], imx[i-1] */

int *bp; /* ptr into bsc[] */

int *ep; /* ptr into esc[] */

int xmb; /* value of xmx[i-1][XMB] */

int xme; /* max for xmx[i][XME] */

int *dpp; /* ptr into dmx[i-1] (previous row) */

int *tpmm, *tpmi, *tpmd, *tpim, *tpii, *tpdm, *tpdd; /* ptrs into tsc */

if((i > 0) && (i <= ((*no_elements_Y_device + 2)* (*no_elements_X_device)))){

 tpmm = &tsc[TMM];

  tpim = &tsc[TIM];

  tpdm = &tsc[TDM];

  tpmd = &tsc[TMD];

  tpdd = &tsc[TDD];

  tpmi = &tsc[TMI];

  tpii = &tsc[TII];

  bp = bsc;

 mc = &mmx[i];

  dc = &dmx[i];

  ic = &imx[i];

  mpp = &mmx[i_minus_one];

  dpp = &dmx[i_minus_one];

  ip = &imx[i_minus_one];

  xmb = xmx[i_xmx + XMB];

  ms = &msc[dsq[tx]];

  is = &isc[dsq[tx]];

  mc[0] = -INFTY;

  dc[0] = -INFTY;

  ic[0] = -INFTY;

 for (k = 1; k <= *no_elements_Y_device; k++) {

   count = count + 4;

   mc[k] = mpp[k-1] + tpmm[k-1];

    if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;

    if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;

    if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;

    mc[k] += ms[k];

    if (mc[k] < -INFTY) mc[k] = -INFTY;

   dc[k] = dc[k-1] + tpdd[k-1];

    if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;

    if (dc[k] < -INFTY) dc[k] = -INFTY;

   if (k < (*no_elements_Y_device)) {

      ic[k] = mpp[k] + tpmi[k];

      if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;

      ic[k] += is[k];

      if (ic[k] < -INFTY) ic[k] = -INFTY;

    }

   __syncthreads();

   if(threadIdx.x == 1)

      atomicAdd(lock, 1);

   while(count != *lock);

  }

}

I think it is more wise to find another way to implement your algorithm, because performance will suck anyway when using only the amount of blocks that the GPU can run at a single time.

That being said, maybe a syncthreads() before the while might help?

Thanks Denis.

Now that I can actually read the code, it appears you are calling syncthreads from inside a branching block. This is a no-no. Think about what happens if some of the threads branch and some don’t. I’m pretty sure this is mentioned somewhere in the programming guide.

This applies to both if statements and loops if all threads in a warp don’t execute the same number of iterations.

This may or may not be causing the specific problem you mention, but it is definitely a problem.

i see, any chance I can use locks (like the blocks i’m doing) also for synchronizing threads inside a loop instead of using syncthreads ??

thank a lot

I simply do not understand why it still getting stuck (it’s not freezing my computer anymore), but simply stops working

I change the code to this

	int tx = (blockDim.x * blockIdx.x) + threadIdx.x;

	int k = 1;
	int count = 0;

	NOT_DONE:
	
	count = count + (gridDim.x * blockDim.x);

	atomicAdd(lock, 1);

	while(count != *lock);
	
	k++;
	if(k <= 3) goto NOT_DONE;

now basically all threads should stop at while(count != *lock); and wait for all threads to add one to make up the number and treads and know that it can continue execution…

if I remove the looping part; the code runs well…but it is only when i try to loop it that the system basically crash…

Can it be that a loop is performed atomically on each tread??

anyone has any idea about inner working of threads scheduling to make my technique not to work…

The funny thing is that if cut and paste the code like the following, it works, but if I looped 3 times instead, it wont

	count = count + (gridDim.x * blockDim.x);

	atomicAdd(lock, 1);

	while(count != *lock);

	count = count + (gridDim.x * blockDim.x);

	atomicAdd(lock, 1);

	while(count != *lock);

	count = count + (gridDim.x * blockDim.x);

	atomicAdd(lock, 1);

	while(count != *lock);

my code makes sense, doesn’t it??

(by the way, I don’t see any special option to post code, how do you post it??)

The CUDA programming guide has plenty of details on the threading model: warps of 32 threads are executed all in one instruction and blocks of many warps are executed by interleaving memory reads and execution.

Blocks are completely independent and no inter-block synchronization can be done except by allowing the kernel to finish. You might search the forums for many other unsuccessful attempts to synchronize blocks. In this light, your code doesn’t make any sense. It will never run efficiently on the data-parallel GPU architecture.

Anyways, to solve your specific crashing issue, you may want to check the ptx assembly output of nvcc. It may be reading *lock once from global memory and caching that value in a register, thus it never checks for a change in the value of *lock in the loop and loops forever. The volatile keyword might work here, but I’m not sure.

Errant pointers (walking off the edge of the PDC etc.) on my dual-G92 box cause very odd behavior. Sometime a reboot, sometimes a hang (GPU hang - no mouse move). The oddest one was every time I ran the CUDA executable the audio plug auto-insert detection dialog box popped up. 100% repeatable. Odd stuff, but always related to pointers walking off into the weeds.

“In this light, your code doesn’t make any sense. It will never run efficiently on the data-parallel GPU architecture.”

Then, I’m ****ed. My prof wants me to somehow synchronize the blocks and extract some performance out of it. hehe <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

What you said about register, makes sense. I added the volatile int type with no luck.

any other idea…

thanks a lot, by the way

I dunno how often people use this, but it usually helps me when my program is having some myterious errors, this line should go into the code before and after calling ur kernel code (if u can see something before computer reset).

printf( "%s\n", cudaGetErrorString( cudaGetLastError() ) );

As for posting code in forum, in the reply page, u should just click on the CODE square and then something like “[ CODE ]” will appear in ur reply area. So code should be enclosed inside this two tags [ CODE ] [ /CODE ] (without the spaces in the square brackets)

Well, it means your prof is not understanding how it works, otherwise he would not ask ;)

When repyling to a post you see above a lightblue-bar some things like [B], [I], FONT, SIZE and COLOR

Below that line you see buttons to insert links,images, quotes and CODE :)

The data-parallel architecture of the GPU is built to process large chunks of data completely independently. By large, I mean 10,000 independent threads large. Just imagine the kinds of stalls you add to the pipeline by adding any sync operation whatsoever… Not to mention that you can’t add syncs because blocks run to completion before freeing up resources, so a block could be waiting on another block yet to run.

Anyways, your code is a little dense so it is hard to understand exactly what is going on. But I do see that you have dependencies on the order of operations mc[k] = mpp[k-1] + tpmm[k-1]; These types of operations seem inherently non-parallel (thus the need for your block syncs), but actually can be parallelized on the GPU nicely. Check out the scan example in the SDK, it is performing a simpler operation than you are here but perhaps the scan algorithm can be adapted to your needs. And read the pdf whitepaper on the scan, it includes all the theory behind it.

thanks…

in that example they do a have a __syncthreads(); inside a loop

for (int d = n/2; d > 0; d >>= 1)
{
    __syncthreads();


}

just to point out that, yes in fact making *lock volatile made a difference.

Assembly with lock as non volatile int*

6 setp.eq.s32 $p1, $r5, $r8; // LOCK == COUNT ??

7 @$p1 bra $Lt_2_12; //IF LOCK == COUNT, JUMP TO Lt_2_12

$Lt_2_14:

8 bra.uni $Lt_2_14; // THE WHILE(LOCK != COUNT)loop

.

code with with lock as a volatile int*

.

$Lt_2_14:

5 ld.global.s32 $r9, [$r2+0]; //

6 setp.ne.s32 $p2, $r9, $r6; // IF LOCK == COUNT, JUMP TO Lt_2_12

7 @$p2 bra $Lt_2_14; // THE WHILE(LOCK != COUNT)loop

So in fact without the volatile, it will never refresh the *lock from memory and will be stuck there forever…

Still, it loops forever thou :argh: