Large switch cases messes up DP functions

NOTE: If you get bored, skip to paragraph 5.

I’m sorry for how long this is going to be, especially since it will require a bit of background before I can get to the problem. Additionally, I simplified the code as much as I could, but it is still around 150 lines. For reference, I’m running on Ubuntu 10.04 64-bit with CUDA 3.1 (problem also existed on 3.0) and 2x GTX 295 (although it doesn’t matter if I’m using 1 or 4 devices for this problem).

The code I’m working on involves evaluating an expression for every pixel in an image. The host code collects the results, does some magic, and and modifies the expression to be run (it’s not only beyond the scope of this problem, it’s not part of my work). The expression is a tree that can contain any combination of 74 different functions ranging from math and trig functions to getting the value of neighboring pixels. Evaluating the tree is simply a matter of traversing the tree and executing the function. Originally, this was recursive and done with function pointers, so I rewrote it to be iterative using three stacks (node pointers, values, and number of times visited). Unfortunately, without function pointers, I ended up with several very large switch statements. The first is based on the number of children the node has (0-3; the only node with 3 children is a conditional), and a helper function is called with another switch statement to finally execute the function.

While I improved on this a bit (nodes are decomposed into arrays for the GPU and stored in shared memory, etc.) and have achieved pretty significant speedup (even a bad solution is much faster when run parallel for hundreds of thousands, or even millions, of pixels), if anyone has a better solution, I’m all ears, especially since that would also eliminate this bug.

As for the actual problem, I had been using single precision for everything but decided to compare the results using double precision. When I did, I discovered that the DP log functions were returning incorrect values (maybe others, but I only know for sure that log and log10 don’t work). After much code reduction, I could not narrow it down to a single problem. Here is the code as compact as I could get it.

[codebox]#include <cutil_inline.h>

#include <cuda.h>

#define MAX_N 512

#define W 8

device float func0(int f, int n)

{

//remove any case statement and log is fine

switch(f) {

    case 0:

        return 0;

    case 1:

        return 1;

    case 2:

        return 2;

    case 3:

        return 3;

    case 4:

        return 4;

    case 5:

        return 5;

    case 6:

        return 6;

    case 7:

        return 7;

    case 8:

        return 8;

    case 9:

        return 9;

    case 10:

        return 10;

    case 11:

        return 11;

    case 12:

        return 12;

    case 13:

        return 13;

    case 14:

        return 14;

    case 15:

        return 15;

    case 16:

        return 16;

    case 17:

        return 17;

    case 18:

        return 18;

    case 19:

        return 19;

case 20:

	return 20;

case 21:

	return 21;

case 22:

	return 22;

//if 69 or below, log is fine

case 70:

	return 70;

default:

	return -1;

}

}

device float func1(int f, float n)

{

//remove any case statement or change any return to a function and log is fine

switch(f) {

case 0:

	return sqrtf(fabsf(n));

case 1:

	return sinf(n);

case 2:

	return cosf(n);

case 3:

	return tanf(n);

case 4:

	return coshf(n);

case 5:

	return tanhf(n);

default:

	return -1;

}

}

global void eval(short *function, double *g_result)

{

int n = threadIdx.x + blockDim.x*blockIdx.x + W*(threadIdx.y + blockDim.y*blockIdx.y);

//remove this and log is fine

if(n % W > W) {

	return;

}

short indices[10];

short np = 0;

short index = 0;

float values[10];

short init = 1;

short valp = 0;

float tmp;

while(np != 0 || init) {

	init = 0;

	//remove any line in this section and log is fine

	switch(function[index+1]) {

		case 0:

			tmp = func0(function[index], n);

			values[valp++] = tmp;

			break;

		case 1:

			tmp = func1(function[index], values[--valp]);

			values[valp++] = tmp;

			break;

		case 2:

			index = indices[--np];

			break;

	}

}

g_result[n] = log(1.0);

}

int main()

{

int N1 = 64;

double *h_pred, *g_pred;

short *h_function, *g_function;

int node_size = 2*sizeof(short);

cudaHostAlloc((void**) &h_pred, N1*sizeof(double), cudaHostAllocPortable);

cudaHostAlloc((void**) &h_function, node_size, cudaHostAllocPortable);

cudaMalloc((void**) &g_pred, N1*sizeof(double));

cudaMalloc((void**) &g_function, node_size);

h_function[0] = 1;

h_function[2] = 0;

cudaMemcpy(g_function, h_function, node_size, cudaMemcpyHostToDevice);

eval<<<8,8>>>(g_function, g_pred);

cudaMemcpy(h_pred, g_pred, N1*sizeof(double), cudaMemcpyDeviceToHost);

int i;

for (i = 0; i < 5; i++) {

	printf("i %d h_pred: %.18f\n",i,h_pred[i]);

}

return 0;

}[/codebox]

(compiled with nvcc -arch sm_13 -lcuda code.cu -I../../common/inc -o out.o)

As it is written, this code (for me, at least) outputs log(1.0) to be -0.693147180534525953 instead of 0. If I make any of the changes noted in the comments (and a few others such as using a constant instead of function[index]), it outputs 0.0000000. I have enough assembly experience to think of a few things it might be doing, but the assembly from this is around 1400 lines, and if I went through it line by line and found the problem, it probably wouldn’t help me figure out how to solve it in the code. I;d appreciate any help anyone might be able to give and would be happy to answer any questions about it.

if (f < 70) return f;
if (f < 70) return f;

I am unable to reproduce this problem. I tried CUDA 3.1 on Linux64 (RHEL 5.3) as well as Win64. The output of the program is:

i 0 h_pred: 0.000000000000000000
i 1 h_pred: 0.000000000000000000
i 2 h_pred: 0.000000000000000000
i 3 h_pred: 0.000000000000000000
i 4 h_pred: 0.000000000000000000

I am unable to reproduce this problem. I tried CUDA 3.1 on Linux64 (RHEL 5.3) as well as Win64. The output of the program is:

i 0 h_pred: 0.000000000000000000
i 1 h_pred: 0.000000000000000000
i 2 h_pred: 0.000000000000000000
i 3 h_pred: 0.000000000000000000
i 4 h_pred: 0.000000000000000000

Sorry. I guess I didn’t make it clear that I removed all of the useful code and reduced it to the bare minimum that would still produce the error. The actual code contains labels that are defined in a header and returns a function value, for example, the sqrtf in func1 is actually from case F_SQRT.

i 0 h_pred: -0.693147180534525953

i 1 h_pred: -0.693147180534525953

i 2 h_pred: -0.693147180534525953

i 3 h_pred: -0.693147180534525953

i 4 h_pred: -0.693147180534525953

What hardware are you using? I think it may be a bug with the GTX 295. I discovered that turning on use-fast-math causes it to not happen, so I think it may be some problem with generating the assembly.

Sorry. I guess I didn’t make it clear that I removed all of the useful code and reduced it to the bare minimum that would still produce the error. The actual code contains labels that are defined in a header and returns a function value, for example, the sqrtf in func1 is actually from case F_SQRT.

i 0 h_pred: -0.693147180534525953

i 1 h_pred: -0.693147180534525953

i 2 h_pred: -0.693147180534525953

i 3 h_pred: -0.693147180534525953

i 4 h_pred: -0.693147180534525953

What hardware are you using? I think it may be a bug with the GTX 295. I discovered that turning on use-fast-math causes it to not happen, so I think it may be some problem with generating the assembly.

I mistakenly thought I had a GT200-class GPU in my Linux system, when in fact I was running with a C2050. I now have a GTX285 in my 64-bit Linux box, and am able to reproduce the behavior in the original post with the 3.1 toolchain. I am also able to reproduce that the outputs change to zeros when the case label is changed from “case 70” to “case 69”. I agree this could be a compiler bug. I will take a closer look at this and file a compiler bug if necessary.

Thanks for bringing this issue to our attention.

I mistakenly thought I had a GT200-class GPU in my Linux system, when in fact I was running with a C2050. I now have a GTX285 in my 64-bit Linux box, and am able to reproduce the behavior in the original post with the 3.1 toolchain. I am also able to reproduce that the outputs change to zeros when the case label is changed from “case 70” to “case 69”. I agree this could be a compiler bug. I will take a closer look at this and file a compiler bug if necessary.

Thanks for bringing this issue to our attention.

Whoops. Need to put on my low-level CS hat (I just spent a semester writing Verilog HDL for a class in processor design, so you’d think I could remember these things). I checked the ptx code generated after changing “case 70” to “case 69”, and the difference was very minimal as you would expect, thus the assembler (or as I said like an idiot, “generating the assembly”) is fine. But since the results are so different, the problem does seem to be with the compiler.

I’m glad you were able to reproduce it.

Whoops. Need to put on my low-level CS hat (I just spent a semester writing Verilog HDL for a class in processor design, so you’d think I could remember these things). I checked the ptx code generated after changing “case 70” to “case 69”, and the difference was very minimal as you would expect, thus the assembler (or as I said like an idiot, “generating the assembly”) is fine. But since the results are so different, the problem does seem to be with the compiler.

I’m glad you were able to reproduce it.

I found and fixed the uninitialized h_function[1] in the source code. This doesn’t make a material difference in terms of repro. After comparing PTX and low-level assembly for the passing and failing case it seems to me that there is a subtle problem with the latter. I will go ahead and file a compiler bug. Thanks for your help.

I found and fixed the uninitialized h_function[1] in the source code. This doesn’t make a material difference in terms of repro. After comparing PTX and low-level assembly for the passing and failing case it seems to me that there is a subtle problem with the latter. I will go ahead and file a compiler bug. Thanks for your help.

No problem. Is there any way for me to follow that report or be notified if there news about it?

No problem. Is there any way for me to follow that report or be notified if there news about it?

Edit: Nevermind. I don’t know if it’s been fixed or not, but it wasn’t my problem.

Edit: Nevermind. I don’t know if it’s been fixed or not, but it wasn’t my problem.