NOTE: If you get bored, skip to paragraph 5.
I’m sorry for how long this is going to be, especially since it will require a bit of background before I can get to the problem. Additionally, I simplified the code as much as I could, but it is still around 150 lines. For reference, I’m running on Ubuntu 10.04 64-bit with CUDA 3.1 (problem also existed on 3.0) and 2x GTX 295 (although it doesn’t matter if I’m using 1 or 4 devices for this problem).
The code I’m working on involves evaluating an expression for every pixel in an image. The host code collects the results, does some magic, and and modifies the expression to be run (it’s not only beyond the scope of this problem, it’s not part of my work). The expression is a tree that can contain any combination of 74 different functions ranging from math and trig functions to getting the value of neighboring pixels. Evaluating the tree is simply a matter of traversing the tree and executing the function. Originally, this was recursive and done with function pointers, so I rewrote it to be iterative using three stacks (node pointers, values, and number of times visited). Unfortunately, without function pointers, I ended up with several very large switch statements. The first is based on the number of children the node has (0-3; the only node with 3 children is a conditional), and a helper function is called with another switch statement to finally execute the function.
While I improved on this a bit (nodes are decomposed into arrays for the GPU and stored in shared memory, etc.) and have achieved pretty significant speedup (even a bad solution is much faster when run parallel for hundreds of thousands, or even millions, of pixels), if anyone has a better solution, I’m all ears, especially since that would also eliminate this bug.
As for the actual problem, I had been using single precision for everything but decided to compare the results using double precision. When I did, I discovered that the DP log functions were returning incorrect values (maybe others, but I only know for sure that log and log10 don’t work). After much code reduction, I could not narrow it down to a single problem. Here is the code as compact as I could get it.
[codebox]#include <cutil_inline.h>
#include <cuda.h>
#define MAX_N 512
#define W 8
device float func0(int f, int n)
{
//remove any case statement and log is fine
switch(f) {
case 0:
return 0;
case 1:
return 1;
case 2:
return 2;
case 3:
return 3;
case 4:
return 4;
case 5:
return 5;
case 6:
return 6;
case 7:
return 7;
case 8:
return 8;
case 9:
return 9;
case 10:
return 10;
case 11:
return 11;
case 12:
return 12;
case 13:
return 13;
case 14:
return 14;
case 15:
return 15;
case 16:
return 16;
case 17:
return 17;
case 18:
return 18;
case 19:
return 19;
case 20:
return 20;
case 21:
return 21;
case 22:
return 22;
//if 69 or below, log is fine
case 70:
return 70;
default:
return -1;
}
}
device float func1(int f, float n)
{
//remove any case statement or change any return to a function and log is fine
switch(f) {
case 0:
return sqrtf(fabsf(n));
case 1:
return sinf(n);
case 2:
return cosf(n);
case 3:
return tanf(n);
case 4:
return coshf(n);
case 5:
return tanhf(n);
default:
return -1;
}
}
global void eval(short *function, double *g_result)
{
int n = threadIdx.x + blockDim.x*blockIdx.x + W*(threadIdx.y + blockDim.y*blockIdx.y);
//remove this and log is fine
if(n % W > W) {
return;
}
short indices[10];
short np = 0;
short index = 0;
float values[10];
short init = 1;
short valp = 0;
float tmp;
while(np != 0 || init) {
init = 0;
//remove any line in this section and log is fine
switch(function[index+1]) {
case 0:
tmp = func0(function[index], n);
values[valp++] = tmp;
break;
case 1:
tmp = func1(function[index], values[--valp]);
values[valp++] = tmp;
break;
case 2:
index = indices[--np];
break;
}
}
g_result[n] = log(1.0);
}
int main()
{
int N1 = 64;
double *h_pred, *g_pred;
short *h_function, *g_function;
int node_size = 2*sizeof(short);
cudaHostAlloc((void**) &h_pred, N1*sizeof(double), cudaHostAllocPortable);
cudaHostAlloc((void**) &h_function, node_size, cudaHostAllocPortable);
cudaMalloc((void**) &g_pred, N1*sizeof(double));
cudaMalloc((void**) &g_function, node_size);
h_function[0] = 1;
h_function[2] = 0;
cudaMemcpy(g_function, h_function, node_size, cudaMemcpyHostToDevice);
eval<<<8,8>>>(g_function, g_pred);
cudaMemcpy(h_pred, g_pred, N1*sizeof(double), cudaMemcpyDeviceToHost);
int i;
for (i = 0; i < 5; i++) {
printf("i %d h_pred: %.18f\n",i,h_pred[i]);
}
return 0;
}[/codebox]
(compiled with nvcc -arch sm_13 -lcuda code.cu -I../../common/inc -o out.o
)
As it is written, this code (for me, at least) outputs log(1.0) to be -0.693147180534525953 instead of 0. If I make any of the changes noted in the comments (and a few others such as using a constant instead of function[index]), it outputs 0.0000000. I have enough assembly experience to think of a few things it might be doing, but the assembly from this is around 1400 lines, and if I went through it line by line and found the problem, it probably wouldn’t help me figure out how to solve it in the code. I;d appreciate any help anyone might be able to give and would be happy to answer any questions about it.