CUDA result changes time to time

Hi all.
I am curious about cuda execution result according to the time.
My cuda application has no time variant module and algorithm like rand function.
But it’s execution result is differ from time to time.
What is the problem? it’s related with cuda?
Thanks in advance.

This could happen for any number of reasons. A likely cause is a bug in your code, such as

(1) race condition
(2) array access out of bounds
(3) reading uninitialized data
(4) invoking undefined C/C++/CUDA behavior
(5) undetected failing CUDA API call or kernel launch
(6) out of bounds access or uninitialized data in the host code

You would want to make sure to examine the status of every CUDA (or CUDA library, e.g. CUBLAS) API call and every CUDA kernel launch. If that comes up clean, run the app with cuda-memcheck, both the out of bounds and the race condition checkers. For the host code, run under control of valgrind or equivalent tool. If possible, have your code reviewed by others.

If your code uses floating-point atomics, results may differ from run to run because floating-point operations are generally not associative, and the order in which data enters a computation (e.g. a sum) is non-deterministic when atomics are used.

Thanks njuffa.
I will try to check my code by cuda-memcheck.

I tried to check my application by cuda-memcheck.
But it returns no errors.
and also I don’t use any floating-point atomics.
What is the problem?

It is impossible to pinpoint problems in code one cannot see.

Have you checked every item on the checklist I gave above? Does your code check the status of every CUDA API call and every kernel launch? As for undefined behavior, a relatively common mistake is to invoke __synchthreads() in a divergent control flow, which causes undefined behavior per the CUDA specification. Have you checked for errors in the host code using a static checker or runtime checkers like valgrind?

You might also want to try to narrow down (e.g. by data inspection) where differing data first enters into the GPU computation, then propagates to the output of the kernel.

Host code is not a culprit.
I have a cpu version counterpart of doing same work with gpu code.
It works unchangingly.
ok. i will try to produce the copy version and debug two applications parallel step by step. found the first differing part.

@njuffa.
I found the part that makes application’s result differ run to run.
It was a critical section part as below.

__syncthreads();
if (threadIdx.x == 0)
	acquire_semaphore(&sem);
__syncthreads();
for (int k = 0; k < blockDim.x; k++)
{
	if (threadIdx.x == k)
	{
		gpStructArray[gnFaceNum] = xStruct;
		gnFaceNum ++;
	}
	__syncthreads();
}
__syncthreads();
if (threadIdx.x == 0)
	release_semaphore(&sem);
__syncthreads();

In above code, gpStructArray is a global struct array and gnFaceNum is a global int variable.
xStruct is the result struct variable from each thread.
I tried to implement critical section in different way. then its ok, thus no change.
But i can’t use it because of time performance.
why above code makes application’s result differ run to run. and also how can i implement the critical section in no change and fast speed?
Thanks.

Do you really need the semaphore?
Your code looks like you just want to write the result xStruct to global memory. If that is what you want then use something like

gpStructArray[blockIdx.x*blockDim.x+threadIdx.x] = xStruct;

If you really need the semaphore and assuming it works properly: Your result may be in different order since it is undetermined which block get’s the semaphore first. Did you check this?

Thanks hadschi118.
I tried to do like you suggested.
But the number of thread is very large and also the size of structure is also big.
So large global memory is needed, and post processing to process the global struct array result is also cost.

I didn’t check semaphore fetching order.
Why it makes application changing run to run?

From the code section that you showed it is not clear what you want to store. Since in that part there is no condition, it looked like you store everything (in a serialized way).

If I understand your semaphore correctly you do the following:
The first threads of each block compete to get the semaphore resource. It is more or less random which of the blocks gets the semaphore first. That means the order in which the blocks write to the global array is random, too.

Without more details it is hard to give you better advice.

I thought My algorithm doesn’t depend on order in which blocks write to the global array.
But it might be culprit to make the result differ.
Thanks