syncthreads error?

Hi hi,

  I encountered a syncthreads error; and I am wonderring when and where should I use it.

  In my device code, I have 4 blocks (0,0), (0,1), (1,0), (1,1), each has 8*8 threads; each thread first read datas from global memory to its shared memory. I was able to print out info until executions in the 1st block (0,0) is finished. Then the program just stop there and doing nothing. I suspect that I am doing something wrong in the _syncthreads. 

  Could anyone tell me when and where should I use __syncthreads()?


Please, add some code so we can tell whether your use of syncthreads is wrong or what is happening.

As to how to use syncthreads, it’s explained in the programming guide. The rest will be given by your experience developong multi-threaded applications.

If u write __syncthreads() in a conditional block then results are weird. You could also see that you are not writing to unallocated global memory in the device code. It locked my machine twice! I hope this helps



Here’s part of the code :0


__global__ void SWKernel(char *g1,char *g2,char *g3,int *gMaxValue12,int *gMaxValue13,int *gMaxValue23,int seq1Size,int seq2Size,int seq3Size,int *distance12,int *trace12,int *distance13,int *trace13,int *distance23,int *trace23){

int diagonalId,dig,l,u,maxValue,tmp,tmpDist,tmpMaxI,tmpMaxJ,tmpDir,x,y,lx,ly,tx,ty,bx,by,i,caseId;

        tmp = tmpDist = tmpMaxI = tmpMaxJ = 0;  

        tmpDir = STOP;

        dig = l = u = -1;

       tx = threadIdx.x;

        ty = threadIdx.y;

        bx = blockIdx.x;

        by = blockIdx.y;

        lx = x = bx*blockDim.x+tx;

        ly = y = by*blockDim.y+ty;

        caseId = 0;

        diagonalId = x+y;


        if(x<seq1Size && y<seq2Size){

          caseId = 1;

        }else if(x>=seq1Size && x<(seq1Size+seq3Size) && y<seq2Size){

              caseId = 2;

              lx = x-seq1Size;

        }else if(x<seq1Size && y>=seq2Size && y<(seq2Size+seq3Size)){

              caseId = 3;

              ly = y-seq2Size;


        printf("tx~%d ty~%d bx~%d by~%d x~%d y~%d diagonalId~%d caseId~%d lx~%d ly~%d\n",tx,ty,bx,by,x,y,diagonalId,caseId,lx,ly);

        __shared__ char sgX[BLOCK_SIZE];

        __shared__ char sgY[BLOCK_SIZE];

       if(y==0 && x<seq1Size){

          sgX[tx] = g1[x];

        }else if(x==0 && y<seq2Size){

     sgY[ty] = g2[y];

        }else if(y==0 && x>=seq1Size && x<(seq1Size+seq3Size)){

     sgX[tx] = g3[x-seq1Size];

        }else if(x==seq1Size && y<seq2Size){

              sgY[ty] = g2[y];

        }else if(y==seq2Size && x<seq1Size){

     sgX[tx] = g1[x];

        }else if(x==0 && y>=seq2Size && y<(seq2Size+seq3Size)){

              sgY[ty] = g3[y-seq2Size];






Don’t know if this is the problem, but a printf statement in a kernel does not make much sense does it?

Well, there’s a breakpoint in that line. I was just tracing the index value of each thread.

At first sight I don’t see anything wrong with the code, usage of syncthreads seems correct.

Have you run it in emulation mode? It will very likely report any problematic usage of syncthreads.

Yes, I ran it under emu mode. Otherwise I cannot see the print out.

Smith-Waterman… Ya i built a gene/protein studio that used the GPU for aligning molecular subsequences underneath using CUDA.

I don’t understand your partitioning strategy though it does seem on the minor diagonal.

Are you using __syncthreads() as a barrier for the previous diagonal to complete?

If your diagonal is distributed through out the blocks then there is no way of synchronizing them.Also try reducing the conditional code in your kernel, it just drains the performance (unless of course they evaluate to the same path in a 1/2 warp)

In any case i will be releasing the entire Gene Studio App to the public in a few days.

That includes a Visual Gene Editor + CUDA library for alignment algorithms.



So I guess it didn’t print any syncthreads usage error and it just stopped working?

In that case my guess is your problem is after the syncthreads.

Yes, you are right, Smith Waterman.

If I comment the __syncthreads(); The program stops after it prints out threads from (0,0) to (3,1) in block(0,0). Otherwise, it stops after prints out all the threads from block(0,0). It actually has 4 blocks totally.

Then I ran it under debug mode, I was lucky to get a screen print when it crashed. The error I got is “CUDA error: Kernel execution failed in file ‘xxx’ in line yyy: the launch timed out and was terminated”.

I really have no idea whaz going on.

No more post? I am still looking for your help <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Hmmm, its difficult to trace the problem. There is no way of testing your kernel as black box without valid inputs. (g1 ,g2 ,g2 are 3 pairwise sequences? )

Anyway try the following,

Do not run your kernel in emulation mode.

Remove the printf statement.

Write all data back to global memory from a single pass.

Check with the results from a single block (without __syncthreads() write to different areas with some prior assumption )

If the kernel is failing then you are definitely writing to unallocated regions of global memory.

Next introduce the __sync for 1 block N threads (1B,NT).

Check the results and post them here!



Thanks for your reply, Neeraj.

I do have tried running the program in both emu mode and simply debug mode. The result gained in emu mode is like that described previously, after (0,0) block the program will stop. Under debug mode, the program will just crash and have the time out errors. If I remove the _syn, and run under emu mode, I will get print out less than one block.


Post your complete kernel with valid inputs here.

Neeraj, I have sent you a message with my code and input. Thanks!

Hi Casybaby,

If you’re still having problems with the algorithm ill be glad to help. I could use the expertise :)

and if you got it working already, can you post the performance of your version of the SW-algorithm here?