CUDA parallelization fail..?

Hi!

I’m currently working on a program which displays some kind of weird race condition behaviour.

First of all, what happens: Program runs fine, but randomly halts execution, waits for a little while and crashes, since the device memory got completely corrupted. Sometimes however, it runs through, and delivers the correct result.

So my guess was: Race condition. I wrote a sequential version of the kernel in question, which ran fine. Then I modified the working (sequential) kernel step by step, so that it became more and more the same code as the parallel version.

Finally, I reached the state where I am now:

I have two versions of my kernel, one runs sequentially and one runs in parallel. They have the exact same input, the same execution configuration, and the rest of the program also is the same for both.

The sequential version is working, the parallel version is not.

The actual problem now is this: the kernel is essentially just two for loops, which should run in parallel without any problems. Every thread block is working with one input space and with one output space.

These spaces are seperated from each other, and there is no communication at all.

Here is the code of the sequential version:

 if(blockIdx.x==0 && threadIdx.x == 0){

   *device_formulas = scan_predicates[gridDim.x-1] + scan_result[gridDim.x-1];

   for(int i=0; i<gridDim.x; i++){

     if(scan_predicates[i] == 1){

         for(int j=0; j<CLAUSES_ABS(FORMULA(&formulas_out,i)); j++)

            FORMULA(&formulas_in,scan_result[i])[j] = FORMULA(&formulas_out,i)[j];

         for(int j=0; j<=VARIABLES(FORMULA(&formulas_out,i)); j++)

            ASSIGNMENT(&assign_in,scan_result[i])[j] = ASSIGNMENT(&assign_out,i)[j];

      }

    }

  }

And here for the parallel version:

 if(threadIdx.x==0){

   *device_formulas = scan_predicates[gridDim.x-1] + scan_result[gridDim.x-1];                                                                                                                           

    if(scan_predicates[blockIdx.x] == 1){

     for(int j=0; j<CLAUSES_ABS(FORMULA(&formulas_out,blockIdx.x)); j++)

        FORMULA(&formulas_in,scan_result[blockIdx.x])[j] = FORMULA(&formulas_out,blockIdx.x)[j];

     for(int j=0; j<=VARIABLES(FORMULA(&formulas_out,blockIdx.x)); j++)

        ASSIGNMENT(&assign_in,scan_result[blockIdx.x])[j] = ASSIGNMENT(&assign_out,blockIdx.x)[j];

    }

  }

Now I know that the parallel version is far from ideal (only one thread per block is used, etc.), but I wanted to stay as close to the working/sequential version as possible.

The only real difference between the kernels (that I can see) is the parallelism, which obviously should work?

If I start the program in debug mode, it crashes with “the launch timed out and was terminated.”, while release mode it reaches an infinity loop (which I can catch and exit).

The reason for the loop seems to be corrupted memory, or the lack of any work being done at all (which is confirmed by the ‘execution failed’ message), but I can’t figure out why this should happen.

Some additional information: the whole program consists of two kernel calls of kernels I wrote myself, and one cudpp scan operation (which would be a third kernel call) in between. These calls are within a loop which runs until the program exits. The scan results are used in both versions of the kernel and seem to be correct.

If someone has an idea what could cause this, any help will be greatly appreciated!

*edit:

I forgot to mention: in emudebug mode, both versions run fine.

bump

New and exciting developments:

After several attempts of solving my problem, I now split up the parallel kernel into three separate parallel kernels. These kernels only process one step of the above quoted parallel kernel and are called one kernel right after the other.

Kernel 1:

if(threadIdx.x==0){

  

  if(blockIdx.x==0)

  	*device_formulas = scan_predicates[gridDim.x-1] + scan_result[gridDim.x-1];

	}

	__syncthreads();

Kernel 2:

if(threadIdx.x==0){

  

  if(scan_predicates[blockIdx.x] == 1){

 	for(int j=0; j<variables+1; j++)

    ASSIGNMENT(&assign_in,scan_result[blockIdx.x])[j] = ASSIGNMENT(&assign_out,blockIdx.x)[j];

  }

	}

	__syncthreads();

Kernel 3:

if(threadIdx.x==0){

  

  if(scan_predicates[blockIdx.x] == 1){

  	

  	for(int j=0; j<clauses; j++)

    FORMULA(&formulas_in,scan_result[blockIdx.x])[j] = FORMULA(&formulas_out,blockIdx.x)[j];

  }

	}

	__syncthreads();

And now funny things are starting to happen:

The error still lurks somewhere, but appears deffinately less often. Before, the program failed at least every second try. Now I have to wait up to five times until the execution fails. The other four times the correct result is returned.

Since all three steps are independent from one another, I can shuffle them around. For example: calling kernel1-kernel-2-kernel3 or calling kernel3-kernel1-kernel2 should display the same behavior. It doesn’t.

What is “consistent” though: kernel 1 never fails. It’s always the kernels with the for loops the race into time out. But depending on the execution order, it’s always the second kernel which fails. Well … ‘always’ would be nice. It’s rather ‘mostly’.

I also tried to modify the .cu file where the kernel sources are written. After placing the kernel code on top of the file (rather than at the end), the program also seemed to fail less often. Funny stuff.

The .cubin says that around 19 registers are used, and all three kernels are called with the exact same parameters and execution configuration.

Any help is still appreciated …

Round and round the logic goes, where it stops, no one knows …

Though this thread doesn’t exactly represent a thriving discussion, I’m still going to post new methods for CUDA debugging and their results. Maybe someone will be able to get more use out of these ideas than I’m currently doing.

Anyways. Since the sequential kernel was working, and the parallelized version not (see post #1), I tried to invert the parallelization:

The first attempt used N blocks, but per block just one single thread. So I wrote a second parallel kernel, which now uses one single block and M threads:

if(blockIdx.x == 0){

	

  for(int block=0; block<gridDim.x; block++){

  

  	if(scan_predicates[block] == 1){

    int* to   = (int*)(ASSIGNMENT(&assign_in,scan_result[block]));

    int* from = (int*)(ASSIGNMENT(&assign_out,block));

  

    for(int i=threadIdx.x; i<variables+1; i+=THREADS_PER_BLOCK)

    	to[i] = from[i];

        

    to   = (int*)(FORMULA(&formulas_in,scan_result[block]));

    from = (int*)(FORMULA(&formulas_out,block));

    

    for(int i=threadIdx.x; i<clauses*3; i+=THREADS_PER_BLOCK)

    	to[i] = from[i];

  	}  

  }

  

  __syncthreads();

  if(threadIdx.x == 0)

  	*device_formulas = scan_predicates[gridDim.x-1] + scan_result[gridDim.x-1];

  __syncthreads();

  

	}

You may also notice that this kernel implements coalescent memory access. The other posted kernels did not, but I changed them accordingly and it didn’t help much, except of course the program crashed faster. Whee.

Now the actual result: while the fist parallel kernel (see post #1) crashed regularly, this one amazingly does not. At least so far (Meaning: After 25 runs and one run with a significantly harder problem, the program always delivered the correct result).

So, the next question would then be: why does parallel kernel #2 work, but parallel kernel #1 does not (i.e. sometimes)?