I tried my hand at reduction today and burnt all my fingers… :-(
This is just a plain addition kernel. It works fine if the “floating” point numbers have zero fractional part (like integers).
However, the momment I use fractional parts (randomly generated floating point numbers), I get in-correct results.
I am aware of the 80-bit thing in x87 and I disable it explicity in gcc options… Here is my makefile:
Even double precision falters at times… but not as badly as singles. And, this is on a TESLA C1060.
Here is my reduction code. Each block reduces to one value… I need to call this in a loop until the number of blocks is 1. But the code fails for even 1 block and 9 elements (4 also fails). It works well if I use thread 0 to sum all elements in a sequential way. Then there is no parallelism… I mainly do this to make sure that there are NO precision related issues i.e. apple to apple comparisons only.
[codebox]
27 global void REDUCE_ADD(REAL *data, int n, REAL *result)
28 {
29 extern device volatile REAL cache;
30 shared volatile int base;
31 int totalData;
32 int fold;
33 REAL temp;
34
35 if (threadIdx.x == 0)
36 {
37 base = (blockIdx.x*(blockDim.x << 1));
38 }
39 __syncthreads();
40
41 if (base >= n)
42 {
43 return;
44 }
45
46 if ((base + threadIdx.x) < n)
47 {
48 cache[threadIdx.x] = data[base + threadIdx.x];
49 //printf("GPU LOAD Thread %d: %f ", threadIdx.x, cache[threadIdx.x]);
50 }
51
52 if ((base + blockDim.x + threadIdx.x) < n)
53 {
54 cache[threadIdx.x] = cache[threadIdx.x] + data[base + blockDim.x + threadIdx.x];
55 }
56
57 __syncthreads();
58
59 totalData = (n - base) > (blockDim.x) ? (blockDim.x) : (n - base);
60 fold = (totalData >> 1);
61 while(fold)
62 {
63 if (threadIdx.x < fold)
64 {
65 //printf(“Adding %f, %f\n”, cache[threadIdx.x], cache[threadIdx.x + fold]);
66 cache[threadIdx.x] = cache[threadIdx.x] + cache[threadIdx.x + fold];
67 if(threadIdx.x == 0)
68 {
69 if ((totalData & 1) == 1)
71 cache[threadIdx.x] = cache[threadIdx.x] + cache[totalData -1];
72 }
73 }
74 }
75
76 totalData = fold;
77 fold = fold >> 1;
78 //if (threadIdx.x ==0)
79 //printf(“\n\n”);
80 __syncthreads();
81
82 }
83 __syncthreads();
84
85 if (threadIdx.x == 0)
86 {
87 result[blockIdx.x] = cache[0];
88 }
89 return;
90 }
[/codebox]
I compare the CPU result and GPU result using an IF statement (NO human comparison) before throwing error…
And, it fails randomly… frequently… though sometimes it works fine… (for integer like floats, it works fine all the time)
Can some1 help me here please?
Specs:
Ubuntu Linux 9.04, CUDA 2.2, TESLA C1060
Best Regards,
Sarnath