race conditioning stopping hang time

I have a kernel that has threads over multiple blocks
the kernel does a whole bunch of calculations and there is a while loop in there testing if a value generated is -1

The problem is that when the kernel ends and gets to my cudaThreadSynchronize(); function (or in fact any other cuda function), the whole application freezes (cant even control-c out)

my guess is that the kernel has race conditions in it where some of the threads get locked in an infinate loop (or something) within the kernel. But this is just a guess

i wish i could put some __synchronise() statements in the kernel, but because im using threads over different blocks, the __synchronise wouldnt work (i think!)

is there some way of terminating threads stuck in a kernel in an infinate loop? Could something else be happening that i havent thought of? Im only guessing that an infinate loop is happening via race conditions, i may be wayyyy off.

summary: cudaThreadSynchronize() hangs after i execute a kernel. Can anyone guess why?

would it help if i posted my kernel code?

That could definitely be the case.


That would help a lot.

By the way, I had no idea Sydney University is using GPGPU. I’m currently visiting UNSW, and there is no one here with experience in this. Could you tell me something about your project? Perhaps I can help?

I am the only one (that i know of). Its for my thesis.

The aim of the project is to make the maximum optimisation of the simplex algorithm using a GPGPU.

The algorithm is basically this - get the pivot column -> get the pivot row -> perform base transformations -> {repeat until all the values in the bottom row are non negative (represented by a negative pivot column)}

i initially had the loop on host code, and called the base transformation as a kernel, but i changed this because it was too slow (had to keep moving the value of the pivot column variable back to host memory so that i could evaluate if another loop was needed).

So instead i made one kernel that did everything (reading the pivotcolumn, pivot and performing the base transformation).

But now, i have the problem of race conditions where one thread will race ahead and change the value of the pivotcolumn/pivot, making another thread work on the incorrect set of data.

The functions findPivotColumn and findPivot are device functions which store the results in the pointer that is fed in through arguments.

This is the code for the algorithm that works so long as you are only using only one block. I know i know __syncthreads everywhere (but it races if they are not put in in the places they are put atm)

__global__ void d_p(float *matrix,int *g_neq,int *g_ncons,int*g_pivotColumn,int*g_pivot,size_t *g_pitch){

        /* Init threads */

        int x = threadIdx.x;

        int y = threadIdx.y;

        int z = threadIdx.z;

       /* Init dimension of block (how many threads */

        int bdx = blockDim.x;

        int bdy = blockDim.y;

        int bdz = blockDim.z;

       /* Init block indexs */

        int bx = blockIdx.x;

        int by = blockIdx.y;

       /* init dimension of grid (how many blocks */

        int gdx = gridDim.x;

       /* Declare variables */

        int tid;

        int eq;

        float pivotFactor;

        float rowFactor;

        float *pivotRow;

        float *eqRow;

       /* Get sequential index for threads spanning multiple blocks */

        tid= by*gdx*(bdx*bdy*bdz) + bx*(bdx*bdy*bdz) + z*(bdx*bdy) + bdx*y +x;

       /* Find first pivot column */



       /* While columns of the function to be optimised hold negative values */


                /* find pivot row */


                /* Init pivot row (actual) */

                pivotRow = (float*)((char*)matrix + *g_pivot * *g_pitch);

                /* Value for pivot factor */

                pivotFactor = pivotRow[*g_pivotColumn];

               /* Loop through rows */


                        /* Row we are working with */

                        eqRow = (float*)((char*)matrix + eq * *g_pitch);

                        /* Dont process pivot row */


                                rowFactor = eqRow[*g_pivotColumn];

                                /* NEED to syncthreads here or it doesnt work! */


                                /* only use threads within the dimension of the matrix */


                                        eqRow[tid] = eqRow[tid] - (rowFactor/pivotFactor)*pivotRow[tid];



                /* find the next pivot column for next iteration */






perhaps if i implemented my own syncthreads function that would work over multiple blocks (so long as the threadindex (tid) was sequential)

Can anyone see anything wrong with this attempt at a syncthreads function?:

int count;

int ready;

ready = 0;

while ( !ready && tid != 0);

if ( tid == 0 ) {





while(count != no_threads)ready=1;

The only thing you know is that a warp contains the threads with consecutive numbers. However:

there is no way of knowing which warp will be executed when, let alone which block.

It is best to avoid these busy-waiting principles and leave it up to the hardware (using __syncthreads and the like).

For the rest, you should perhaps reconsider parts of your algorithm. The other code has similar constructions. It is for example very likely that *g_pivotColumn will never be set to -1 by the threads running first. As well as a __syncthread in a if/while:

Forum thread of the month, edition #11,000. In other words, this has been discussed to death. You cannot sync among multiple blocks in a general purpose, scalable, efficient, or usable way. Proof by example: Imagine you have 17 blocks: blocks 2-17 depend on a value generated by block 1. Assume that only 16 blocks can run concurrently on the device. Sometimes when you run your code, blocks 2-17 will launch first and will deadlock while waiting on a value from block 1 which cannot possibly be set because block 1 isn’t running!

Despite the impossibility, limitations, and complete loss of their sanity, many have still tried. Search the forums and you may find some hints from those that have taken this road (here is one hint, the compiler is smart enough to notice many reads from the same global memory location and cache the value in a register… volatile can help).