Hi!
I’m currently working on a program which displays some kind of weird race condition behaviour.
First of all, what happens: Program runs fine, but randomly halts execution, waits for a little while and crashes, since the device memory got completely corrupted. Sometimes however, it runs through, and delivers the correct result.
So my guess was: Race condition. I wrote a sequential version of the kernel in question, which ran fine. Then I modified the working (sequential) kernel step by step, so that it became more and more the same code as the parallel version.
Finally, I reached the state where I am now:
I have two versions of my kernel, one runs sequentially and one runs in parallel. They have the exact same input, the same execution configuration, and the rest of the program also is the same for both.
The sequential version is working, the parallel version is not.
The actual problem now is this: the kernel is essentially just two for loops, which should run in parallel without any problems. Every thread block is working with one input space and with one output space.
These spaces are seperated from each other, and there is no communication at all.
Here is the code of the sequential version:
if(blockIdx.x==0 && threadIdx.x == 0){
*device_formulas = scan_predicates[gridDim.x-1] + scan_result[gridDim.x-1];
for(int i=0; i<gridDim.x; i++){
if(scan_predicates[i] == 1){
for(int j=0; j<CLAUSES_ABS(FORMULA(&formulas_out,i)); j++)
FORMULA(&formulas_in,scan_result[i])[j] = FORMULA(&formulas_out,i)[j];
for(int j=0; j<=VARIABLES(FORMULA(&formulas_out,i)); j++)
ASSIGNMENT(&assign_in,scan_result[i])[j] = ASSIGNMENT(&assign_out,i)[j];
}
}
}
And here for the parallel version:
if(threadIdx.x==0){
*device_formulas = scan_predicates[gridDim.x-1] + scan_result[gridDim.x-1];
if(scan_predicates[blockIdx.x] == 1){
for(int j=0; j<CLAUSES_ABS(FORMULA(&formulas_out,blockIdx.x)); j++)
FORMULA(&formulas_in,scan_result[blockIdx.x])[j] = FORMULA(&formulas_out,blockIdx.x)[j];
for(int j=0; j<=VARIABLES(FORMULA(&formulas_out,blockIdx.x)); j++)
ASSIGNMENT(&assign_in,scan_result[blockIdx.x])[j] = ASSIGNMENT(&assign_out,blockIdx.x)[j];
}
}
Now I know that the parallel version is far from ideal (only one thread per block is used, etc.), but I wanted to stay as close to the working/sequential version as possible.
The only real difference between the kernels (that I can see) is the parallelism, which obviously should work?
If I start the program in debug mode, it crashes with “the launch timed out and was terminated.”, while release mode it reaches an infinity loop (which I can catch and exit).
The reason for the loop seems to be corrupted memory, or the lack of any work being done at all (which is confirmed by the ‘execution failed’ message), but I can’t figure out why this should happen.
Some additional information: the whole program consists of two kernel calls of kernels I wrote myself, and one cudpp scan operation (which would be a third kernel call) in between. These calls are within a loop which runs until the program exits. The scan results are used in both versions of the kernel and seem to be correct.
If someone has an idea what could cause this, any help will be greatly appreciated!
*edit:
I forgot to mention: in emudebug mode, both versions run fine.