float reduction, cpu and cuda answers differ

Hi all,

This is a bit of a puzzler for me. I’ve implemented basically a carbon copy of the reduce2 kernel from the SDK but i’m working with a reduction of float3 objects instead of int’s. Performance issues aside i’ve noticed some slightly strange behaviour from my code, I have been testing the code by setting the initial values of the in array to be their index (cast as a float). So the reduction of this array, of length n, should give be

Sn = (n/2)(2 + (n-1))

see for instance: http://en.wikipedia.org/wiki/Arithmetic_progression

So if we have n = 2^20 we get

Sn = 549756338176.000000 (from calculation)

Sn_cuda = 549755289600.000000 (from cuda kernel)

Sn_cpu = 549821087744.000000 (from the cpu)

it seems to provide the correct answer for the first few significant figures but then both the cuda and the cpu answers are quite different from the calculated one. I’ve never really worked with either reductions before and also have no real feeling for the limitations of floats since i’ve always worked with doubles. Is this error creeping in from the accumulation of FPU error or something insidious to do with casting?

Should i be looking at the assembly and PTX code here? I don’t really know anything about troubleshooting in this way. I’m running on 64bit linux on an ultra.

Thanks very much for looking.

my kernel looks like this, i guess this is not efficient but the real answer is more important at the moment.

__global__ void

reduce(float3 *g_idata, float3 *g_odata)

{

  /* the shared memory is saved on a per block basis! so you only need sizeof_data * blocksize 

    * memory size when you call the kernel */

    extern __shared__ float3 sdata[]; 

   // load shared mem

    unsigned int tid = threadIdx.x;

    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[tid].x = g_idata[i].x; // we load our block's shared memory from the global

    sdata[tid].y = g_idata[i].y; // we have to load x,y,z 

    sdata[tid].z = g_idata[i].z; 

  

    __syncthreads();

   // do reduction in shared mem. stride over the block and

    // reduce the parts until we get down to a single value (sdata[0])

    for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

        if (tid < s) {

            sdata[tid].x += sdata[tid + s].x;

      sdata[tid].y += sdata[tid + s].y;

      sdata[tid].z += sdata[tid + s].z;

        }

        __syncthreads();

    }

   // write result for this block to global mem

    if (tid == 0) {

    	g_odata[blockIdx.x].x = sdata[0].x;

    	g_odata[blockIdx.x].y = sdata[0].y;

    	g_odata[blockIdx.x].z = sdata[0].z;

    }

}

and my calling function looks like this

//! this is to test my home modified float3 reduction 

/* include the things we need*/

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include <stdio.h>

#include <gsl/gsl_rng.h>

#include <sys/time.h>

/* cuda things */

#include <cuda.h>

#include <ctime>

#include "reduction_kernel_2.cu" /* get the kernel */

#define BLOCKMAX 512 // this is the largest block we can launch

void fire_up_cuda(void);

void cpu_reduction(float3* in_list, float3* out_list, float3* the_answer, int list_size);

void cuda_reduction(float3* in_list, float3* out_list, float3* the_answer, int list_size);

/* we init each of the cells in the array to it's index value,

 * as such the sum is given by Sn = n/2 (2 + (n-1)) which is a nice easy 

 * way to check it works */

int main (void){

	int list_size = 1048576; /* how many float3's we're going to reduce */

	

	fire_up_cuda(); /* start up the card */

	

	float3* h_in_list = 0x0;	/* allocate memory on the host */

	float3* h_out_list = 0x0;

	cudaMallocHost((void**)&h_in_list, sizeof(float3)*list_size);

	cudaMallocHost((void**)&h_out_list, sizeof(float3)*list_size);

	

	for (int i = 0; i < list_size; i++){ /* set everything on the h_in_list to i to test */ 

  h_in_list[i].x = (float)i;

  h_in_list[i].y = (float)i;

  h_in_list[i].z = (float)i;

  /*h_in_list[i].x = 1.0e-3;

  h_in_list[i].y = 1.0e-3;

  h_in_list[i].z = 1.0e-3;*/

  h_out_list[i].x = 0; /* set everythign on the h_out_list to 0 just for safety */

  h_out_list[i].y = 0;

  h_out_list[i].z = 0;

	}

	

	float3 the_answer = make_float3(0.0, 0.0, 0.0);

	float3 cpu_answer = make_float3(0.0, 0.0, 0.0);

	cpu_reduction(h_in_list, 0x0, &cpu_answer, list_size); // do the reduction on the cpu	

	cuda_reduction(h_in_list, h_out_list, &the_answer, list_size); // do the reduction on the gpu

	

	printf("finished reduction\n");

	printf("cuda answer: %f\t%f\t%f\n", the_answer.x, the_answer.y, the_answer.z);

	printf("cpu answer: %f\t%f\t%f\n", cpu_answer.x, cpu_answer.y, cpu_answer.z);

	float calculated_answer = (((float)list_size)/2.0)*(2.0+(((float)list_size)-1.0));

	printf("the sequence should sum to: %f\n", calculated_answer);

	cudaFreeHost(h_in_list);

	cudaFreeHost(h_out_list);

	return(0);

}

//! performs a cpu reduction on the in_list, and the answer is plonked into the_answer

/** this function does a reduction on the in_list and stores the result in the_answer

 * @param in_list is an allocated array which is summed over

 * @param out_list is not used in this fn

 * @param the_answer is set to the sum of in_list on return

 * @param is the list_size we need blocksize to be a divisor of list_size

 */

void cpu_reduction(float3* in_list, float3* out_list, float3* the_answer, int list_size){

	float3 counter = make_float3(0.0, 0.0, 0.0);

	

	for (int i = 0; i < list_size; i++){

  counter.x += in_list[i].x;

  counter.y += in_list[i].y;

  counter.z += in_list[i].z;

	}

	

	the_answer->x = counter.x;

	the_answer->y = counter.y;

	the_answer->z = counter.z;

	

}

//! performs a cuda reduction on the in_list into the out_list and returns the answer as a float3 

/** this function performs the reduction on the given in_list, using the out_list as a holding space in

 * the memory on the card. 

 * @param in_list this should be a cudaMallocHost'd array (for DMA transfers) and should be 

 *  full of the float3's that need to be summed

 * @param out_list this should be a cudaMallocHost'd array (for DMA transfers) and should probably

 *  	be allocated to zero's to stop any untoward nastiness

 * @param list_size at the moment this function will really only work if the internal variable blocksize is

 * a divisor of list_size, otherwise we're going to get into a mess. The way to fix this is probably

 * to pad the list with zero's if it's not a divisor (although this might be a bit crappy)

 * @param the_answer on return this is set to the result of the reduction of the in_list

 */ 

void cuda_reduction(float3* in_list, float3* out_list, float3* the_answer, int list_size){

	// allocate memory on the card

	float3* d_in_list = 0x0;

	float3* d_out_list = 0x0;

	

	int mem_size = sizeof(float3)*list_size;

	cudaMalloc((void**)&d_in_list, mem_size);

	cudaMalloc((void**)&d_out_list, mem_size);

	/* copy data onto the card */

	cudaMemcpy(d_in_list, in_list, mem_size, cudaMemcpyHostToDevice);  

	cudaMemcpy(d_out_list, out_list, mem_size, cudaMemcpyHostToDevice);

	

	/* now we have to work out what block size etc to use and call the reduction 

  * we need to reduce multiple times, each reduction takes the initial 

  * list_size and reduces it by list_size / blocksize */ 

	int blocksize = BLOCKMAX; // how many threads per block (ATM this has to be a divisor of list_size)

	int gridsize = list_size / blocksize; // how many blocks in total to do the job 

	/* do the first reduction */

	dim3 dimBlock(blocksize, 1, 1); // the size of a block

	dim3 dimGrid(gridsize, 1, 1);  // the number of blocks

	int smemsize = blocksize*sizeof(float3); // how much smem we need per kernel invocation

	reduce<<<dimGrid, dimBlock, smemsize>>>(d_in_list, d_out_list); 

	

	

	int temp_size = list_size / 2; // since we've done one reduction already

	

	while(temp_size > 1){ // loop and reduce, the rest of the reductions are in place

  	blocksize = (temp_size < BLOCKMAX) ? temp_size : BLOCKMAX; // pick how many blocks to do

  	gridsize = temp_size / blocksize; 

  	

  	// just to show what's going on

  	printf("temp_size = %d, blocksize = %d, gridsize = %d\n", temp_size, blocksize, gridsize);

  	

  	dim3 dimBlock(blocksize, 1, 1); // the size of a block

  	dim3 dimGrid(gridsize, 1, 1);  // the number of blocks

  	int smemsize = blocksize*sizeof(float3); // how much smem we need per kernel invocation

  	reduce<<<dimGrid, dimBlock, smemsize>>>(d_out_list, d_out_list); // call the kernel

  	

  	temp_size = temp_size / blocksize; // reduce the temp_size by the no of threads

	}

	

	

	// now we're done, copy back the temp_size values (should be 1)

	cudaMemcpy(out_list, d_out_list, sizeof(float3)*temp_size, cudaMemcpyDeviceToHost);

	the_answer->x = out_list[0].x;

	the_answer->y = out_list[0].y;

	the_answer->z = out_list[0].z;

	

	cudaFree(d_in_list);

	cudaFree(d_out_list);	

}

//! scans for cuda devices and turns one on. 

/** this queries for devices supporting cuda and then turns on the first one  

 * that's available. I'm not sure how this would work with multiple cards */

void fire_up_cuda(void){

	// fire up the cuda device

	int deviceCount;

	cudaGetDeviceCount(&deviceCount); // find out how many things there are

	if (deviceCount == 0){

  fprintf(stderr, "there is no device, bad news...\n");

  exit(1);	

	} else {

  fprintf(stderr, "deviceCount: %d\n", deviceCount);

	}

	int dev;

	for(dev = 0; dev < deviceCount; ++dev) {

  cudaDeviceProp deviceProp;

  cudaGetDeviceProperties(&deviceProp, dev);

  if(deviceProp.major >= 1) {

  	fprintf(stderr, "device supports cuda, neat\n");

  	break;

  }	

	}

	if (dev == deviceCount) {

  	fprintf(stderr, "there is no cuda device, sorry...\n");

  	exit(1);

	} else {

  fprintf(stderr, "cuda is good to go...\n");

  cudaSetDevice(dev); // make this device go

	}

}

also, i’m compiling with -g -O0 so there shouldn’t be any horrid optimization problems in the CPU

The reason for that is floating point arithmetic is not commutative.

(A + B ) + C is not neccessarily equal to A + (B + C). Especially when the number
of added values grows, you’ll notice loss of significance and cancellations.
So the sequential sum a_0 + a_1 + … + a_n will probably differ from
any permutation a_(i0) + a_(i1) + … + a_(in) – usually, this should not be a great
problem. I mean, who says that the sequential result is nearer to the truth?

See also:
http://en.wikipedia.org/wiki/Kahan_summation_algorithm

thanks guys, that’s very interesting :)
I should be able to fix things now