PRNG produces indeterministic results

Hello,

Trying to run a simulation on a GTX 480 I encountered varying results upon completion. As I couldn’t find the cause for this, I broke the problem down to the most simple kernel that exhibits this strange behaviour:

#define N 1024*1024

__global__ void indeterministic(unsigned long* sums, unsigned int* seeds) {

	int idx = blockIdx.x*blockDim.x+threadIdx.x;

	unsigned int ran = seeds[idx];

	ran ^= ran << 17;

	ran ^= ran >> 14;

	seeds[idx] = ran;

	sums[idx] += ran;

}

int main(int argc, char** argv) {

	unsigned long* sums;

	unsigned int* seeds;

	cudaMalloc(&sums, N*sizeof(unsigned long));

	cudaMalloc(&seeds, N*sizeof(unsigned int));

	unsigned int* h_seeds = (unsigned int*) malloc(N*sizeof(unsigned int));

	for(int i = 0; i < N; i++)

		h_seeds[i] = i+1;

	for(int j = 0; j < 2; j++) {

		cudaMemset(sums, 0, N*sizeof(unsigned long));

		cudaMemcpy(seeds, h_seeds, N*sizeof(unsigned int), cudaMemcpyHostToDevice);

		for(int i = 0; i < 100000; i++) {

			indeterministic<<<N/256, 256>>>(sums, seeds);

		}

		unsigned long* h_sums = (unsigned long*) malloc(N*sizeof(unsigned long));

		cudaMemcpy(h_sums, sums, N*sizeof(unsigned long), cudaMemcpyDeviceToHost);

		unsigned long sum = 0;

		for(int i = 0; i < N; i++)

			sum += h_sums[i];

		printf("%lu\n", sum);

		free(h_sums);

	}

	cudaFree(sums);

	cudaFree(seeds);

	free(h_seeds);

	return(0);

}

Sometimes when running this code runaway values will appear in the printed results.

I’d appreciate any help.

What output do you get?
I ran 20 times, got 13964483715582132224 printed (twice) each time.

Some small bugs (but none explain your problem).

  1. The xorshift PRNG you’re using requires its state to be non-zero. So the values thread 0 produces are in fact all zero.
  2. You’re missing a “cudaFree(seeds);” at the end
  3. your #define N is dangerous since if it were changed to be a non-multiple of 256, you’d be computing a different sum than you asked for because of the integer block size divide
  4. “unsigned” as an unqualified type is depreciated C. Still valid, but prone to misunderstanding!

On my GTX480 this takes about 30 seconds to run.

I don’t actually see any bug from scanning the code, but a caveat: it’s 2:30 in the morning for me.

Thanks for the reply. Those bugs probably just slipped in there when I threw this code together to demonstrate the problem. I changed everything anyway and excecuted again.

Result:

13964620536084108458

13964570768871854712

Sometimes I get right results as well. Increasing the length of the run results in more and bigger discrepancies. Amounts this to a technical problem?

maybe there is bad memory on your board? Try running cuda-memtest and see if it passes.

OK, tried that. But it passes all standard tests(0-8,10). I might run test 9(Bit fade) and a longer stress test over the weekend.

[Edit:]

It seems the stress test has confirmed that bad memory is the culprit. Now we’ll have to get the GPU replaced.