float reduction, cpu and cuda answers differ

relain · March 31, 2008, 11:32pm

Hi all,

This is a bit of a puzzler for me. I’ve implemented basically a carbon copy of the reduce2 kernel from the SDK but i’m working with a reduction of float3 objects instead of int’s. Performance issues aside i’ve noticed some slightly strange behaviour from my code, I have been testing the code by setting the initial values of the in array to be their index (cast as a float). So the reduction of this array, of length n, should give be

Sn = (n/2)(2 + (n-1))

see for instance: http://en.wikipedia.org/wiki/Arithmetic_progression

So if we have n = 2^20 we get

Sn = 549756338176.000000 (from calculation)

Sn_cuda = 549755289600.000000 (from cuda kernel)

Sn_cpu = 549821087744.000000 (from the cpu)

it seems to provide the correct answer for the first few significant figures but then both the cuda and the cpu answers are quite different from the calculated one. I’ve never really worked with either reductions before and also have no real feeling for the limitations of floats since i’ve always worked with doubles. Is this error creeping in from the accumulation of FPU error or something insidious to do with casting?

Should i be looking at the assembly and PTX code here? I don’t really know anything about troubleshooting in this way. I’m running on 64bit linux on an ultra.

Thanks very much for looking.

my kernel looks like this, i guess this is not efficient but the real answer is more important at the moment.

__global__ void

reduce(float3 *g_idata, float3 *g_odata)

{

  /* the shared memory is saved on a per block basis! so you only need sizeof_data * blocksize 

    * memory size when you call the kernel */

    extern __shared__ float3 sdata[]; 

   // load shared mem

    unsigned int tid = threadIdx.x;

    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[tid].x = g_idata[i].x; // we load our block's shared memory from the global

    sdata[tid].y = g_idata[i].y; // we have to load x,y,z 

    sdata[tid].z = g_idata[i].z; 

  

    __syncthreads();

   // do reduction in shared mem. stride over the block and

    // reduce the parts until we get down to a single value (sdata[0])

    for(unsigned int s=blockDim.x/2; s>0; s>>=1) {

        if (tid < s) {

            sdata[tid].x += sdata[tid + s].x;

      sdata[tid].y += sdata[tid + s].y;

      sdata[tid].z += sdata[tid + s].z;

        }

        __syncthreads();

    }

   // write result for this block to global mem

    if (tid == 0) {

    	g_odata[blockIdx.x].x = sdata[0].x;

    	g_odata[blockIdx.x].y = sdata[0].y;

    	g_odata[blockIdx.x].z = sdata[0].z;

    }

}

and my calling function looks like this

//! this is to test my home modified float3 reduction 

/* include the things we need*/

#include <stdlib.h>

#include <string.h>

#include <math.h>

#include <stdio.h>

#include <gsl/gsl_rng.h>

#include <sys/time.h>

/* cuda things */

#include <cuda.h>

#include <ctime>

#include "reduction_kernel_2.cu" /* get the kernel */

#define BLOCKMAX 512 // this is the largest block we can launch

void fire_up_cuda(void);

void cpu_reduction(float3* in_list, float3* out_list, float3* the_answer, int list_size);

void cuda_reduction(float3* in_list, float3* out_list, float3* the_answer, int list_size);

/* we init each of the cells in the array to it's index value,

 * as such the sum is given by Sn = n/2 (2 + (n-1)) which is a nice easy 

 * way to check it works */

int main (void){

	int list_size = 1048576; /* how many float3's we're going to reduce */

	

	fire_up_cuda(); /* start up the card */

	

	float3* h_in_list = 0x0;	/* allocate memory on the host */

	float3* h_out_list = 0x0;

	cudaMallocHost((void**)&h_in_list, sizeof(float3)*list_size);

	cudaMallocHost((void**)&h_out_list, sizeof(float3)*list_size);

	

	for (int i = 0; i < list_size; i++){ /* set everything on the h_in_list to i to test */ 

  h_in_list[i].x = (float)i;

  h_in_list[i].y = (float)i;

  h_in_list[i].z = (float)i;

  /*h_in_list[i].x = 1.0e-3;

  h_in_list[i].y = 1.0e-3;

  h_in_list[i].z = 1.0e-3;*/

  h_out_list[i].x = 0; /* set everythign on the h_out_list to 0 just for safety */

  h_out_list[i].y = 0;

  h_out_list[i].z = 0;

	}

	

	float3 the_answer = make_float3(0.0, 0.0, 0.0);

	float3 cpu_answer = make_float3(0.0, 0.0, 0.0);

	cpu_reduction(h_in_list, 0x0, &cpu_answer, list_size); // do the reduction on the cpu	

	cuda_reduction(h_in_list, h_out_list, &the_answer, list_size); // do the reduction on the gpu

	

	printf("finished reduction\n");

	printf("cuda answer: %f\t%f\t%f\n", the_answer.x, the_answer.y, the_answer.z);

	printf("cpu answer: %f\t%f\t%f\n", cpu_answer.x, cpu_answer.y, cpu_answer.z);

	float calculated_answer = (((float)list_size)/2.0)*(2.0+(((float)list_size)-1.0));

	printf("the sequence should sum to: %f\n", calculated_answer);

	cudaFreeHost(h_in_list);

	cudaFreeHost(h_out_list);

	return(0);

}

//! performs a cpu reduction on the in_list, and the answer is plonked into the_answer

/** this function does a reduction on the in_list and stores the result in the_answer

 * @param in_list is an allocated array which is summed over

 * @param out_list is not used in this fn

 * @param the_answer is set to the sum of in_list on return

 * @param is the list_size we need blocksize to be a divisor of list_size

 */

void cpu_reduction(float3* in_list, float3* out_list, float3* the_answer, int list_size){

	float3 counter = make_float3(0.0, 0.0, 0.0);

	

	for (int i = 0; i < list_size; i++){

  counter.x += in_list[i].x;

  counter.y += in_list[i].y;

  counter.z += in_list[i].z;

	}

	

	the_answer->x = counter.x;

	the_answer->y = counter.y;

	the_answer->z = counter.z;

	

}

//! performs a cuda reduction on the in_list into the out_list and returns the answer as a float3 

/** this function performs the reduction on the given in_list, using the out_list as a holding space in

 * the memory on the card. 

 * @param in_list this should be a cudaMallocHost'd array (for DMA transfers) and should be 

 *  full of the float3's that need to be summed

 * @param out_list this should be a cudaMallocHost'd array (for DMA transfers) and should probably

 *  	be allocated to zero's to stop any untoward nastiness

 * @param list_size at the moment this function will really only work if the internal variable blocksize is

 * a divisor of list_size, otherwise we're going to get into a mess. The way to fix this is probably

 * to pad the list with zero's if it's not a divisor (although this might be a bit crappy)

 * @param the_answer on return this is set to the result of the reduction of the in_list

 */ 

void cuda_reduction(float3* in_list, float3* out_list, float3* the_answer, int list_size){

	// allocate memory on the card

	float3* d_in_list = 0x0;

	float3* d_out_list = 0x0;

	

	int mem_size = sizeof(float3)*list_size;

	cudaMalloc((void**)&d_in_list, mem_size);

	cudaMalloc((void**)&d_out_list, mem_size);

	/* copy data onto the card */

	cudaMemcpy(d_in_list, in_list, mem_size, cudaMemcpyHostToDevice);  

	cudaMemcpy(d_out_list, out_list, mem_size, cudaMemcpyHostToDevice);

	

	/* now we have to work out what block size etc to use and call the reduction 

  * we need to reduce multiple times, each reduction takes the initial 

  * list_size and reduces it by list_size / blocksize */ 

	int blocksize = BLOCKMAX; // how many threads per block (ATM this has to be a divisor of list_size)

	int gridsize = list_size / blocksize; // how many blocks in total to do the job 

	/* do the first reduction */

	dim3 dimBlock(blocksize, 1, 1); // the size of a block

	dim3 dimGrid(gridsize, 1, 1);  // the number of blocks

	int smemsize = blocksize*sizeof(float3); // how much smem we need per kernel invocation

	reduce<<<dimGrid, dimBlock, smemsize>>>(d_in_list, d_out_list); 

	

	

	int temp_size = list_size / 2; // since we've done one reduction already

	

	while(temp_size > 1){ // loop and reduce, the rest of the reductions are in place

  	blocksize = (temp_size < BLOCKMAX) ? temp_size : BLOCKMAX; // pick how many blocks to do

  	gridsize = temp_size / blocksize; 

  	

  	// just to show what's going on

  	printf("temp_size = %d, blocksize = %d, gridsize = %d\n", temp_size, blocksize, gridsize);

  	

  	dim3 dimBlock(blocksize, 1, 1); // the size of a block

  	dim3 dimGrid(gridsize, 1, 1);  // the number of blocks

  	int smemsize = blocksize*sizeof(float3); // how much smem we need per kernel invocation

  	reduce<<<dimGrid, dimBlock, smemsize>>>(d_out_list, d_out_list); // call the kernel

  	

  	temp_size = temp_size / blocksize; // reduce the temp_size by the no of threads

	}

	

	

	// now we're done, copy back the temp_size values (should be 1)

	cudaMemcpy(out_list, d_out_list, sizeof(float3)*temp_size, cudaMemcpyDeviceToHost);

	the_answer->x = out_list[0].x;

	the_answer->y = out_list[0].y;

	the_answer->z = out_list[0].z;

	

	cudaFree(d_in_list);

	cudaFree(d_out_list);	

}

//! scans for cuda devices and turns one on. 

/** this queries for devices supporting cuda and then turns on the first one  

 * that's available. I'm not sure how this would work with multiple cards */

void fire_up_cuda(void){

	// fire up the cuda device

	int deviceCount;

	cudaGetDeviceCount(&deviceCount); // find out how many things there are

	if (deviceCount == 0){

  fprintf(stderr, "there is no device, bad news...\n");

  exit(1);	

	} else {

  fprintf(stderr, "deviceCount: %d\n", deviceCount);

	}

	int dev;

	for(dev = 0; dev < deviceCount; ++dev) {

  cudaDeviceProp deviceProp;

  cudaGetDeviceProperties(&deviceProp, dev);

  if(deviceProp.major >= 1) {

  	fprintf(stderr, "device supports cuda, neat\n");

  	break;

  }	

	}

	if (dev == deviceCount) {

  	fprintf(stderr, "there is no cuda device, sorry...\n");

  	exit(1);

	} else {

  fprintf(stderr, "cuda is good to go...\n");

  cudaSetDevice(dev); // make this device go

	}

}

relain · March 31, 2008, 11:38pm

also, i’m compiling with -g -O0 so there shouldn’t be any horrid optimization problems in the CPU

_chefkoch · April 1, 2008, 5:36am

The reason for that is floating point arithmetic is not commutative.

(A + B ) + C is not neccessarily equal to A + (B + C). Especially when the number
of added values grows, you’ll notice loss of significance and cancellations.
So the sequential sum a_0 + a_1 + … + a_n will probably differ from
any permutation a_(i0) + a_(i1) + … + a_(in) – usually, this should not be a great
problem. I mean, who says that the sequential result is nearer to the truth?

Simon_Green · April 1, 2008, 8:48am

See also:
[url=“Kahan summation algorithm - Wikipedia”]http://en.wikipedia.org/wiki/Kahan_summation_algorithm[/url]

relain · April 1, 2008, 5:27pm

thanks guys, that’s very interesting :)
I should be able to fix things now

Topic		Replies	Views
Basic reduction with CUDA CUDA Programming and Performance	1	567	March 22, 2018
Reduction Reduction Reduction................. Precision Confusion Race Condition...... HELP! CUDA Programming and Performance	16	10615	December 8, 2009
Why these two scripts give different results? CUDA Programming and Performance	5	1863	June 19, 2012
Reduction kernel issues with data larger than 16M floats CUDA Programming and Performance	5	1031	June 6, 2017
Result of reduction in GPU do not match with the CPU's, also GPU's result vary with blocksize Legacy PGI Compilers	4	979	June 23, 2020
Precision error - CUDA kernel Vs CPU Reduction kernel with fabs and differnce operations on float CUDA Programming and Performance	3	2097	May 6, 2011
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1890	January 14, 2009
Reduction sample in SDK CUDA Programming and Performance	4	3713	February 23, 2010
Reduction random errors Reduction kernel turns weird values CUDA Programming and Performance	2	833	February 7, 2011
CUDA reduction precision issues CUDA Programming and Performance cuda	1	454	December 13, 2022

float reduction, cpu and cuda answers differ

Related topics