syncthreads error?

casybaby · May 18, 2008, 1:19am

Hi hi,

  I encountered a syncthreads error; and I am wonderring when and where should I use it.

  In my device code, I have 4 blocks (0,0), (0,1), (1,0), (1,1), each has 8*8 threads; each thread first read datas from global memory to its shared memory. I was able to print out info until executions in the 1st block (0,0) is finished. Then the program just stop there and doing nothing. I suspect that I am doing something wrong in the _syncthreads. 


  Could anyone tell me when and where should I use __syncthreads()?

  Appreciate!!!

bashflyng · May 18, 2008, 10:38am

Please, add some code so we can tell whether your use of syncthreads is wrong or what is happening.

As to how to use syncthreads, it’s explained in the programming guide. The rest will be given by your experience developong multi-threaded applications.

Neeraj · May 18, 2008, 1:58pm

If u write __syncthreads() in a conditional block then results are weird. You could also see that you are not writing to unallocated global memory in the device code. It locked my machine twice! I hope this helps

Cheers,

Neeraj

casybaby · May 18, 2008, 3:43pm

Here’s part of the code :0

   

__global__ void SWKernel(char *g1,char *g2,char *g3,int *gMaxValue12,int *gMaxValue13,int *gMaxValue23,int seq1Size,int seq2Size,int seq3Size,int *distance12,int *trace12,int *distance13,int *trace13,int *distance23,int *trace23){

int diagonalId,dig,l,u,maxValue,tmp,tmpDist,tmpMaxI,tmpMaxJ,tmpDir,x,y,lx,ly,tx,ty,bx,by,i,caseId;

        tmp = tmpDist = tmpMaxI = tmpMaxJ = 0;  

        tmpDir = STOP;

        dig = l = u = -1;

       tx = threadIdx.x;

        ty = threadIdx.y;

        bx = blockIdx.x;

        by = blockIdx.y;

        lx = x = bx*blockDim.x+tx;

        ly = y = by*blockDim.y+ty;

        caseId = 0;

        diagonalId = x+y;

        

        if(x<seq1Size && y<seq2Size){

          caseId = 1;

        }else if(x>=seq1Size && x<(seq1Size+seq3Size) && y<seq2Size){

              caseId = 2;

              lx = x-seq1Size;

        }else if(x<seq1Size && y>=seq2Size && y<(seq2Size+seq3Size)){

              caseId = 3;

              ly = y-seq2Size;

        }

        printf("tx~%d ty~%d bx~%d by~%d x~%d y~%d diagonalId~%d caseId~%d lx~%d ly~%d\n",tx,ty,bx,by,x,y,diagonalId,caseId,lx,ly);

        __shared__ char sgX[BLOCK_SIZE];

        __shared__ char sgY[BLOCK_SIZE];

       if(y==0 && x<seq1Size){

          sgX[tx] = g1[x];

        }else if(x==0 && y<seq2Size){

     sgY[ty] = g2[y];

        }else if(y==0 && x>=seq1Size && x<(seq1Size+seq3Size)){

     sgX[tx] = g3[x-seq1Size];

        }else if(x==seq1Size && y<seq2Size){

              sgY[ty] = g2[y];

        }else if(y==seq2Size && x<seq1Size){

     sgX[tx] = g1[x];

        }else if(x==0 && y>=seq2Size && y<(seq2Size+seq3Size)){

              sgY[ty] = g3[y-seq2Size];

        }

      

        __syncthreads();

      ......

}

eirik · May 18, 2008, 4:37pm

Don’t know if this is the problem, but a printf statement in a kernel does not make much sense does it?

casybaby · May 18, 2008, 5:42pm

Well, there’s a breakpoint in that line. I was just tracing the index value of each thread.

bashflyng · May 18, 2008, 8:57pm

At first sight I don’t see anything wrong with the code, usage of syncthreads seems correct.

Have you run it in emulation mode? It will very likely report any problematic usage of syncthreads.

casybaby · May 18, 2008, 9:12pm

Yes, I ran it under emu mode. Otherwise I cannot see the print out.

Neeraj · May 19, 2008, 5:00am

Here’s part of the code :0

   

__global__ void SWKernel(char *g1,char *g2,char *g3,int *gMaxValue12,int *gMaxValue13,int *gMaxValue23,int seq1Size,int seq2Size,int seq3Size,int *distance12,int *trace12,int *distance13,int *trace13,int *distance23,int *trace23){

int diagonalId,dig,l,u,maxValue,tmp,tmpDist,tmpMaxI,tmpMaxJ,tmpDir,x,y,lx,ly,tx,ty,bx,by,i,caseId;

        tmp = tmpDist = tmpMaxI = tmpMaxJ = 0;  

        tmpDir = STOP;

        dig = l = u = -1;

       tx = threadIdx.x;

        ty = threadIdx.y;

        bx = blockIdx.x;

        by = blockIdx.y;

        lx = x = bx*blockDim.x+tx;

        ly = y = by*blockDim.y+ty;

        caseId = 0;

        diagonalId = x+y;

        

        if(x<seq1Size && y<seq2Size){

          caseId = 1;

        }else if(x>=seq1Size && x<(seq1Size+seq3Size) && y<seq2Size){

              caseId = 2;

              lx = x-seq1Size;

        }else if(x<seq1Size && y>=seq2Size && y<(seq2Size+seq3Size)){

              caseId = 3;

              ly = y-seq2Size;

        }

        printf("tx~%d ty~%d bx~%d by~%d x~%d y~%d diagonalId~%d caseId~%d lx~%d ly~%d\n",tx,ty,bx,by,x,y,diagonalId,caseId,lx,ly);

        __shared__ char sgX[BLOCK_SIZE];

        __shared__ char sgY[BLOCK_SIZE];

       if(y==0 && x<seq1Size){

          sgX[tx] = g1[x];

        }else if(x==0 && y<seq2Size){

     sgY[ty] = g2[y];

        }else if(y==0 && x>=seq1Size && x<(seq1Size+seq3Size)){

     sgX[tx] = g3[x-seq1Size];

        }else if(x==seq1Size && y<seq2Size){

              sgY[ty] = g2[y];

        }else if(y==seq2Size && x<seq1Size){

     sgX[tx] = g1[x];

        }else if(x==0 && y>=seq2Size && y<(seq2Size+seq3Size)){

              sgY[ty] = g3[y-seq2Size];

        }

      

        __syncthreads();

      ......

}

[snapback]378921[/snapback]

Smith-Waterman… Ya i built a gene/protein studio that used the GPU for aligning molecular subsequences underneath using CUDA.

I don’t understand your partitioning strategy though it does seem on the minor diagonal.

Are you using __syncthreads() as a barrier for the previous diagonal to complete?

If your diagonal is distributed through out the blocks then there is no way of synchronizing them.Also try reducing the conditional code in your kernel, it just drains the performance (unless of course they evaluate to the same path in a 1/2 warp)

In any case i will be releasing the entire Gene Studio App to the public in a few days.

That includes a Visual Gene Editor + CUDA library for alignment algorithms.

Cheers,

Neeraj

bashflyng · May 21, 2008, 12:01am

So I guess it didn’t print any syncthreads usage error and it just stopped working?

In that case my guess is your problem is after the syncthreads.

casybaby · May 22, 2008, 12:18am

Yes, you are right, Smith Waterman.

If I comment the __syncthreads(); The program stops after it prints out threads from (0,0) to (3,1) in block(0,0). Otherwise, it stops after prints out all the threads from block(0,0). It actually has 4 blocks totally.

Then I ran it under debug mode, I was lucky to get a screen print when it crashed. The error I got is “CUDA error: Kernel execution failed in file ‘xxx’ in line yyy: the launch timed out and was terminated”.

I really have no idea whaz going on.

casybaby · May 24, 2008, 1:49pm

No more post? I am still looking for your help <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

Neeraj · May 27, 2008, 4:43pm

Hmmm, its difficult to trace the problem. There is no way of testing your kernel as black box without valid inputs. (g1 ,g2 ,g2 are 3 pairwise sequences? )

Anyway try the following,

Do not run your kernel in emulation mode.

Remove the printf statement.

Write all data back to global memory from a single pass.

Check with the results from a single block (without __syncthreads() write to different areas with some prior assumption )

If the kernel is failing then you are definitely writing to unallocated regions of global memory.

Next introduce the __sync for 1 block N threads (1B,NT).

Check the results and post them here!

Cheers,

Neeraj

casybaby · May 31, 2008, 5:30am

Thanks for your reply, Neeraj.

I do have tried running the program in both emu mode and simply debug mode. The result gained in emu mode is like that described previously, after (0,0) block the program will stop. Under debug mode, the program will just crash and have the time out errors. If I remove the _syn, and run under emu mode, I will get print out less than one block.

:(

Neeraj · May 31, 2008, 1:27pm

Post your complete kernel with valid inputs here.

casybaby · May 31, 2008, 3:41pm

Neeraj, I have sent you a message with my code and input. Thanks!

BvdVeen · June 2, 2008, 12:04pm

Hi Casybaby,

If you’re still having problems with the algorithm ill be glad to help. I could use the expertise :)

and if you got it working already, can you post the performance of your version of the SW-algorithm here?

Cheers,
Bernd

Topic		Replies	Views
Syncthreads and Stalling Kernels CUDA Programming and Performance	16	4148	August 26, 2010
cuda syncthreads fail CUDA Programming and Performance	7	3878	February 22, 2013
__syncthreads() problem __syncthreads() results in infinite loop CUDA Programming and Performance	5	2274	August 27, 2008
How is this kernel locking with __syncthreads()? CUDA Programming and Performance	2	611	April 23, 2018
Problems with __syncthreads() CUDA Programming and Performance	2	950	May 4, 2013
shared memory and __syncthreads() one writer, n readers CUDA Programming and Performance	5	3041	August 25, 2008
__syncthreads screwes calculation CUDA Programming and Performance	2	3424	November 22, 2007
Bug report: Threads out of sync, branched syncthreads problem CUDA Programming and Performance	2	1705	November 30, 2009
Shared Memory Problems - __syncthreads() doesn't work? CUDA Programming and Performance	5	2680	December 29, 2011
what 'incorrect use of __syncthreads()' means ? CUDA Programming and Performance	4	3002	September 9, 2008

syncthreads error?

Related topics