Switch oddities Compiler bug?

I was fiddling with one of my kernels, and I found some odd behaviour which seems to be causing switch statements to evaluate incorrectly.

This works:

unsigned int a = 1;

for (...) {

	<do stuff based a (value doesn't change)>

	if (a < 3) {

  a++;

  a %= 3;

	}

	else if (a == 3) {

  a = 1337;

	}

}

This also works:

unsigned int a = 1;

for (...) {

	<do stuff based a (value doesn't change)>

	switch(a) {

  case 0: a = 1; break;

  case 1: a = 2; break;

  case 2: a = 0; break;	

	}

}

This doesn’t:

unsigned int a = 1;

for (...) {

	<do stuff based a (value doesn't change)>

	switch(a) {

  case 0: a = 1; break;

  case 1: a = 2; break;

  case 2: a = 0; break;	

  case 3: a = 1337; break;	

	}

}

The first and third codeblocks are identical logically. The value of a should never be 3, however adding a case for if it is 3 in the switch statement has an effect, while adding a check in an if-else statement doesn’t.

I’ve investigated slightly further. If I store the value of ‘a’ into a global variable before the switch statement, everything works. This implies to me that some sort of nasty optimization is going on somewhere, messing about with ‘a’ before it gets to the switch statement.

This works:

unsigned int a = 1;

for (...) {

<do stuff based a (value doesn't change)>

some_global_memory[0] = a; // This line is vital!

switch(a) {

 case 0: a = 1; break;

 case 1: a = 2; break;

 case 2: a = 0; break;

 case 3: a = 1337; break;

}

}

I’ve tried to produce a small test file showing this, but have thusfar failed to reproduce it. Might try again tomorrow.

BTW:

XP Pro

CUDA 2.0

177.41

Gefore GTX 260

for good measure, make the variable volatile. This should have the same effect as storing it into a global.

EDIT: The volatile keyword has no effect. It seems to require a global store.

Okies - I have a simplish test program that reproduces the error. All the functions listed should give the same results, however test2_GPU only gives the correct results when run in emulation mode.

Here is my output when run on the GPU:

Size 3: GPU

0.000000

10.000000

10.000000

10.000000

20.000000

-----------

Size 4: GPU

0.000000

10.000000

20.000000

30.000000

30.000000

-----------

Size 3: CPU

0.000000

10.000000

10.000000

10.000000

20.000000

-----------

Size 4: CPU

0.000000

10.000000

10.000000

10.000000

20.000000

I think I’d still be quite suprised if this is a compiler error, however I just can’t see any logic errors, and the emulation mode works fine. The GPU and CPU code and CPU code is identical and it only runs 1 thread in this test case.

EDIT: Attachment didn’t work

EDIT 2: See two posts down for attachment.

Well, at least your CPU code fails to initialize x[0] AFAICT

Oops :">. Must have uploaded an old version. Sorry!

Fixed it - gives the same results.
test.txt (5.1 KB)

Anybody else had a chance to look at this? I’m still at a loss to explain why it isn’t working.

What is the effect that you are seeing?

Is it an effect on performance OR program correctness?

I think it’s a bad idea to use doubles on the GPU, i changed everything to floats in your example and voila:

Size 3: GPU

0.000000

10.000000

10.000000

10.000000

20.000000

-----------

Size 4: GPU

0.000000

10.000000

10.000000

10.000000

20.000000

-----------

Size 3: CPU

0.000000

10.000000

10.000000

10.000000

20.000000

-----------

Size 4: CPU

0.000000

10.000000

10.000000

10.000000

20.000000

Program correctness. All of the runs should be calculating the same thing, however the second run (Size 4: GPU) produces different results.

My program needs to run in double precision and double precision is supported on my card, though it’s interesting that it works find with single (or maybe just on your card/setup).

hmm the -arch sm_13 does make a difference with nvcc 2, but well i guess it’s just really beta and buggy anyways… what does your nvcc -V say?

mine seems a bit wired ( i have 1.1 and 2.0 installed) :

here’s the 2.0:

stephaga@biwidl02:~/cuda2/cuda/bin $ ./nvcc -V && which ./nvcc

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2007 NVIDIA Corporation

Built on Tue_Jun_10_05:42:45_PDT_2008

Cuda compilation tools, release 1.1, V0.2.1221

./nvcc

–> Cuda compilation tools, release 1.1, V0.2.1221 (should’t this be release 2.0 ?)

here’s the 1.1

stephaga@biwidl02:~/cuda2/cuda/bin $ nvcc -V && which nvcc

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2006 NVIDIA Corporation

Built on Thu_Nov_29_19:14:37_PST_2007

Cuda compilation tools, release 1.1, V0.2.1221

/usr/sepp/bin/nvcc

stephaga@biwidl02:~/cuda2/cuda/bin $

this seems ok, also if i use this compiler, then the -arch sm_13 option fails

stephaga@biwidl02:~/cuda2 $ ~/cuda2/cuda/bin/nvcc ../cudabug2double.cu -I ~/cuda2/sdk/common/inc/ -o cudabug2double

stephaga@biwidl02:~/cuda2 $ ./cudabug2double

Size 3: GPU

524288.000000

524288.127197

0.000000

0.000000

0.000000

-----------

Size 4: GPU

524288.000000

524288.127197

0.000000

0.000000

0.000000

-----------

Size 3: CPU

0.000000

10.000000

10.000000

10.000000

20.000000

-----------

Size 4: CPU

0.000000

10.000000

10.000000

10.000000

20.000000

stephaga@biwidl02:~/cuda2 $ ~/cuda2/cuda/bin/nvcc -arch sm_13 ../cudabug2double.cu -I ~/cuda2/sdk/common/inc/ -o cudabug2double

stephaga@biwidl02:~/cuda2 $ ./cudabug2double

Size 3: GPU

0.000000

10.000000

10.000000

10.000000

20.000000

-----------

Size 4: GPU

0.000000

10.000000

20.000000

30.000000

39.992187

-----------

Size 3: CPU

0.000000

10.000000

10.000000

10.000000

20.000000

-----------

Size 4: CPU

0.000000

10.000000

10.000000

10.000000

20.000000

You initialize only the first element:

double xRegCache[3];	

xRegCache[0] = x[0];

It is difficult to see with all the for-s and switches whether or not you are using uninitialized memory. On the CPU values will default to zero, on the GPU they are undefined. Is this the problem?

Edit: typo

The rest of the elements are initialised within the loop.

Interesting. If I manually initialize the variables I do indeed get different results for the GPU with 4, however I still get the same results for all the other tests. This means that the GPU version of the algoritm is accessing uninitialized variables, while an identical CPU version isn’t. Given that the GPU is just running one thread, this seems wrong.

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2007 NVIDIA Corporation

Built on Thu_Jun_12_01:14:00_PDT_2008

Cuda compilation tools, release 1.1, V0.2.1221

The timestamp is slightly different, but the release number is the same.

Just got back around to this part of the program. This error seems to be still here with CUDA 2.0.