this code resets my computer

mc_line · March 27, 2008, 4:46am

Hi

I can some one tell my why this code freezes and resents my computer.
I am using 4 blocks to execute it and my gpu has 4 microprocessors

global static void P7Viterbi_cuda_device_loop_two(int *imx, int *mmx, int *dmx, int *no_elements_Y_device, int *no_elements_X_device, int *xmx, unsigned char *dsq, int * xsc, int *tsc, int * bsc, int * msc, int isc, int esc, int * lock){

int tx = (blockDim.x * blockIdx.x) + threadIdx.x;

int k;
int count = 0;

.
.
.

	for (k = 1; k <= *no_elements_Y_device; k++) {
		  
		  count = count + 4;
	.
        .
            .
            .
	  
		  __syncthreads();
		  
		  if(threadIdx.x == 1)
			atomicAdd(lock, 1);
			
		  while(count != *lock);
		  
	}

}

DenisR · March 27, 2008, 8:28am

You have a while with ; behind it, inside a for loop?
what is inside the while loop?

it looks like you have lots of possibilities for deadlocks as far as I can see.

jordyvaneijk · March 27, 2008, 8:44am

This is an very interesting topic. Because I also had some resets of my computer for no apparent reason. So maybe this is a bug in CUDA… I still don’t know why my code resets my computer sometimes. It does not reach the 5sec. watchdog but resets before this.

BTW I’m not using any while or for loops in my kernel / code

Please tell us.

mc_line · March 27, 2008, 4:11pm

Hi thank you for you response.

The propose of that while in there is to stop the treads and continue until the 4 blocks have reach the same point.

actually if I remove the for loop, they while does not causes problem and the code executes perfectly. However, i need that for loop

is this valid??

please, anyone knows what is going on?
why is this getting stuck and freezing my computer

DenisR · March 27, 2008, 4:48pm

Can you post an indented version between CODE tags? That will be more readable, so it might be easier to see what you are trying to do.

mc_line · March 27, 2008, 6:38pm

Actually, i removed all the rest of the code and still freezes and I am trying it to run just as follows.

int tx = (blockDim.x * blockIdx.x) + threadIdx.x;

int k;
int count = 0;

for (k = 1; k <= *no_elements_Y_device; k++) {

count = count + 4;

__syncthreads();

if(threadIdx.x == 1)
atomicAdd(lock, 1);

while(count != *lock);

}
}

The full code supposes to be like this:

global static void P7Viterbi_cuda_device_loop_two(int *imx, int *mmx, int *dmx, int *no_elements_Y_device, int *no_elements_X_device, int *xmx, unsigned char *dsq, int * xsc, int *tsc, int * bsc, int * msc, int isc, int esc, int * lock){

		int tx = (blockDim.x * blockIdx.x) + threadIdx.x;
		
		int i = tx * (*no_elements_Y_device + 2);
		int i_minus_one = (tx-1) * (*no_elements_Y_device + 2);
		
		int i_xmx = tx * 4;
		int i_xmx_minus_one = (tx-1) * 4;
		
		int k, count = 0;
	
	  int sc;
	  int  *mc, *dc, *ic;        /* pointers to rows of mmx, dmx, imx */
	  int  *ms, *is;             /* pointers to msc[i], isc[i] */
	  int  *mpp, *mpc, *ip;      /* ptrs to mmx[i-1], mmx[i], imx[i-1] */
	  int  *bp;		     /* ptr into bsc[] */
	  int  *ep;                  /* ptr into esc[] */
	  int   xmb;		     /* value of xmx[i-1][XMB] */
	  int   xme;                 /* max for xmx[i][XME] */
	  int  *dpp;                 /* ptr into dmx[i-1] (previous row) */
	  int  *tpmm, *tpmi, *tpmd, *tpim, *tpii, *tpdm, *tpdd; /* ptrs into tsc */

if((i > 0) && (i <= ((*no_elements_Y_device + 2)* (*no_elements_X_device)))){		
		
		tpmm  = &tsc[TMM];
		tpim  = &tsc[TIM];
		tpdm  = &tsc[TDM];
		tpmd  = &tsc[TMD];
		tpdd  = &tsc[TDD];
		tpmi  = &tsc[TMI];
		tpii  = &tsc[TII];
		bp    = bsc;
		
		mc    = &mmx[i];    
		dc    = &dmx[i];
		ic    = &imx[i];
		mpp   = &mmx[i_minus_one];
		dpp   = &dmx[i_minus_one];
		ip    = &imx[i_minus_one];
		xmb   = xmx[i_xmx + XMB];
		ms    = &msc[dsq[tx]];
		is    = &isc[dsq[tx]];
		mc[0] = -INFTY;
		dc[0] = -INFTY;
		ic[0] = -INFTY;

	for (k = 1; k <= *no_elements_Y_device; k++) {
		  
		  count = count + 4;
		  
		  mc[k] = mpp[k-1]   + tpmm[k-1];
		  if ((sc = ip[k-1]  + tpim[k-1]) > mc[k])  mc[k] = sc;
		  if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k])  mc[k] = sc;
		  if ((sc = xmb  + bp[k])         > mc[k])  mc[k] = sc; 
		  mc[k] += ms[k];
		  if (mc[k] < -INFTY) mc[k] = -INFTY;  

		  dc[k] = dc[k-1] + tpdd[k-1];
		  if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;
		  if (dc[k] < -INFTY) dc[k] = -INFTY;  

		  if (k < (*no_elements_Y_device)) {
		  ic[k] = mpp[k] + tpmi[k];
		  if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc; 
		  ic[k] += is[k];
		  if (ic[k] < -INFTY) ic[k] = -INFTY; 
		  }
		  
		  __syncthreads();
		  
		  if(threadIdx.x == 1)
			atomicAdd(lock, 1);
			
		  while(count != *lock);
		  
	}
}

basically the [i][k] depends on [i-1][k-1] so I am trying to use the 4 microprocessors (using 4 blocks) in a synchronized manner to compute [all i’s][k], then [all i’s][k +1]…then [all i’s][k + 2] and so on.

Basically, when a block finishes with all k’s will add 1 to lock (we have 4 blocks so count will be 4 on the first loop) then when count == lock means that all blocks have computed all i’s for that k and can go compute the all i’s for the next (k+1) and again each block will add 1 to lock to make 8 …

get it??

mgiles · March 27, 2008, 7:09pm

Can you post the host code that calls the simplified kernel.

DenisR · March 27, 2008, 7:20pm

Use the CODE button above your textfield, it really helps ;)

__global__ static void P7Viterbi_cuda_device_loop_two(int *imx, int *mmx, int *dmx, int *no_elements_Y_device, int *no_elements_X_device, int *xmx, unsigned char *dsq, int * xsc, int *tsc, int * bsc, int * msc, int *isc, int* esc, int * lock){

int tx = (blockDim.x * blockIdx.x) + threadIdx.x;

int i = tx * (*no_elements_Y_device + 2);

int i_minus_one = (tx-1) * (*no_elements_Y_device + 2);

int i_xmx = tx * 4;

int i_xmx_minus_one = (tx-1) * 4;

int k, count = 0;

int sc;

int *mc, *dc, *ic; /* pointers to rows of mmx, dmx, imx */

int *ms, *is; /* pointers to msc[i], isc[i] */

int *mpp, *mpc, *ip; /* ptrs to mmx[i-1], mmx[i], imx[i-1] */

int *bp; /* ptr into bsc[] */

int *ep; /* ptr into esc[] */

int xmb; /* value of xmx[i-1][XMB] */

int xme; /* max for xmx[i][XME] */

int *dpp; /* ptr into dmx[i-1] (previous row) */

int *tpmm, *tpmi, *tpmd, *tpim, *tpii, *tpdm, *tpdd; /* ptrs into tsc */

if((i > 0) && (i <= ((*no_elements_Y_device + 2)* (*no_elements_X_device)))){

 tpmm = &tsc[TMM];

  tpim = &tsc[TIM];

  tpdm = &tsc[TDM];

  tpmd = &tsc[TMD];

  tpdd = &tsc[TDD];

  tpmi = &tsc[TMI];

  tpii = &tsc[TII];

  bp = bsc;

 mc = &mmx[i];

  dc = &dmx[i];

  ic = &imx[i];

  mpp = &mmx[i_minus_one];

  dpp = &dmx[i_minus_one];

  ip = &imx[i_minus_one];

  xmb = xmx[i_xmx + XMB];

  ms = &msc[dsq[tx]];

  is = &isc[dsq[tx]];

  mc[0] = -INFTY;

  dc[0] = -INFTY;

  ic[0] = -INFTY;

 for (k = 1; k <= *no_elements_Y_device; k++) {

   count = count + 4;

   mc[k] = mpp[k-1] + tpmm[k-1];

    if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc;

    if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc;

    if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc;

    mc[k] += ms[k];

    if (mc[k] < -INFTY) mc[k] = -INFTY;

   dc[k] = dc[k-1] + tpdd[k-1];

    if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;

    if (dc[k] < -INFTY) dc[k] = -INFTY;

   if (k < (*no_elements_Y_device)) {

      ic[k] = mpp[k] + tpmi[k];

      if ((sc = ip[k] + tpii[k]) > ic[k]) ic[k] = sc;

      ic[k] += is[k];

      if (ic[k] < -INFTY) ic[k] = -INFTY;

    }

   __syncthreads();

   if(threadIdx.x == 1)

      atomicAdd(lock, 1);

   while(count != *lock);

  }

}

DenisR · March 27, 2008, 7:28pm

I think it is more wise to find another way to implement your algorithm, because performance will suck anyway when using only the amount of blocks that the GPU can run at a single time.

That being said, maybe a syncthreads() before the while might help?

eelsen · March 27, 2008, 9:39pm

Thanks Denis.

Now that I can actually read the code, it appears you are calling syncthreads from inside a branching block. This is a no-no. Think about what happens if some of the threads branch and some don’t. I’m pretty sure this is mentioned somewhere in the programming guide.

This applies to both if statements and loops if all threads in a warp don’t execute the same number of iterations.

This may or may not be causing the specific problem you mention, but it is definitely a problem.

mc_line · March 27, 2008, 11:50pm

i see, any chance I can use locks (like the blocks i’m doing) also for synchronizing threads inside a loop instead of using syncthreads ??

thank a lot

mc_line · March 28, 2008, 1:07am

I simply do not understand why it still getting stuck (it’s not freezing my computer anymore), but simply stops working

I change the code to this

	int tx = (blockDim.x * blockIdx.x) + threadIdx.x;

	int k = 1;
	int count = 0;

	NOT_DONE:
	
	count = count + (gridDim.x * blockDim.x);

	atomicAdd(lock, 1);

	while(count != *lock);
	
	k++;
	if(k <= 3) goto NOT_DONE;

now basically all threads should stop at while(count != *lock); and wait for all threads to add one to make up the number and treads and know that it can continue execution…

if I remove the looping part; the code runs well…but it is only when i try to loop it that the system basically crash…

Can it be that a loop is performed atomically on each tread??

anyone has any idea about inner working of threads scheduling to make my technique not to work…

The funny thing is that if cut and paste the code like the following, it works, but if I looped 3 times instead, it wont

	count = count + (gridDim.x * blockDim.x);

	atomicAdd(lock, 1);

	while(count != *lock);

	count = count + (gridDim.x * blockDim.x);

	atomicAdd(lock, 1);

	while(count != *lock);

	count = count + (gridDim.x * blockDim.x);

	atomicAdd(lock, 1);

	while(count != *lock);

my code makes sense, doesn’t it??

(by the way, I don’t see any special option to post code, how do you post it??)

MisterAnderson42 · March 28, 2008, 1:26am

The CUDA programming guide has plenty of details on the threading model: warps of 32 threads are executed all in one instruction and blocks of many warps are executed by interleaving memory reads and execution.

Blocks are completely independent and no inter-block synchronization can be done except by allowing the kernel to finish. You might search the forums for many other unsuccessful attempts to synchronize blocks. In this light, your code doesn’t make any sense. It will never run efficiently on the data-parallel GPU architecture.

Anyways, to solve your specific crashing issue, you may want to check the ptx assembly output of nvcc. It may be reading *lock once from global memory and caching that value in a register, thus it never checks for a change in the value of *lock in the loop and loops forever. The volatile keyword might work here, but I’m not sure.

BonsaiScott · March 28, 2008, 1:26am

Errant pointers (walking off the edge of the PDC etc.) on my dual-G92 box cause very odd behavior. Sometime a reboot, sometimes a hang (GPU hang - no mouse move). The oddest one was every time I ran the CUDA executable the audio plug auto-insert detection dialog box popped up. 100% repeatable. Odd stuff, but always related to pointers walking off into the weeds.

mc_line · March 28, 2008, 4:32am

“In this light, your code doesn’t make any sense. It will never run efficiently on the data-parallel GPU architecture.”

Then, I’m ****ed. My prof wants me to somehow synchronize the blocks and extract some performance out of it. hehe <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

What you said about register, makes sense. I added the volatile int type with no luck.

any other idea…

thanks a lot, by the way

shinkee · March 28, 2008, 5:16am

I dunno how often people use this, but it usually helps me when my program is having some myterious errors, this line should go into the code before and after calling ur kernel code (if u can see something before computer reset).

printf( "%s\n", cudaGetErrorString( cudaGetLastError() ) );

As for posting code in forum, in the reply page, u should just click on the CODE square and then something like “[ CODE ]” will appear in ur reply area. So code should be enclosed inside this two tags [ CODE ] [ /CODE ] (without the spaces in the square brackets)

DenisR · March 28, 2008, 6:19am

Well, it means your prof is not understanding how it works, otherwise he would not ask ;)

When repyling to a post you see above a lightblue-bar some things like [B], [I], FONT, SIZE and COLOR

Below that line you see buttons to insert links,images, quotes and CODE :)

MisterAnderson42 · March 28, 2008, 8:24am

The data-parallel architecture of the GPU is built to process large chunks of data completely independently. By large, I mean 10,000 independent threads large. Just imagine the kinds of stalls you add to the pipeline by adding any sync operation whatsoever… Not to mention that you can’t add syncs because blocks run to completion before freeing up resources, so a block could be waiting on another block yet to run.

Anyways, your code is a little dense so it is hard to understand exactly what is going on. But I do see that you have dependencies on the order of operations mc[k] = mpp[k-1] + tpmm[k-1]; These types of operations seem inherently non-parallel (thus the need for your block syncs), but actually can be parallelized on the GPU nicely. Check out the scan example in the SDK, it is performing a simpler operation than you are here but perhaps the scan algorithm can be adapted to your needs. And read the pdf whitepaper on the scan, it includes all the theory behind it.

mc_line · March 28, 2008, 6:26pm

thanks…

in that example they do a have a __syncthreads(); inside a loop

for (int d = n/2; d > 0; d >>= 1)
{
    __syncthreads();

…
}

mc_line · March 28, 2008, 8:52pm

just to point out that, yes in fact making *lock volatile made a difference.

Assembly with lock as non volatile int*

6 setp.eq.s32 $p1, $r5, $r8; // LOCK == COUNT ??

7 @$p1 bra $Lt_2_12; //IF LOCK == COUNT, JUMP TO Lt_2_12

$Lt_2_14:

8 bra.uni $Lt_2_14; // THE WHILE(LOCK != COUNT)loop

.

code with with lock as a volatile int*

.

$Lt_2_14:

5 ld.global.s32 $r9, [$r2+0]; //

6 setp.ne.s32 $p2, $r9, $r6; // IF LOCK == COUNT, JUMP TO Lt_2_12

7 @$p2 bra $Lt_2_14; // THE WHILE(LOCK != COUNT)loop

So in fact without the volatile, it will never refresh the *lock from memory and will be stuck there forever…

Still, it loops forever thou External Image

Topic		Replies	Views
CUDA Memory Consistency CUDA Programming and Performance	23	55537	March 8, 2007
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19750	January 21, 2010
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
clock() function on titanX CUDA Programming and Performance	22	2209	March 12, 2022
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29323	April 23, 2010
__syncblocks 101 Primitives for Interblock syncronization CUDA Programming and Performance	16	10008	February 29, 2008
Global thread barrier CUDA Programming and Performance	78	85666	December 23, 2011
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8781	December 1, 2009
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2772	December 8, 2010

this code resets my computer

Related topics