A strange bug about CUDA computing

hakuna · December 14, 2007, 4:11am

I wrote a little CUDA program to test the computing precision with CPU computing result. giving them the same input data, but when the input size is 130560 or multiple of that number, the result is very strange. next is my code:

mytest.cu

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <cutil.h>

// includes, kernels

#include <mytest_kernel.cu>

int iDivUp(int a, int b){

	return ((a % b) != 0) ? (a / b + 1) : (a / b);

}

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest( int argc, char** argv);

extern "C"

void computeGold( unsigned int *h_Data, unsigned int *h_Result, int dataN);

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv) 

{

    runTest( argc, argv);

   CUT_EXIT(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void

runTest( int argc, char** argv) 

{

    CUT_DEVICE_INIT();

	int i;

	const int dataN = 130560;

	const int dataSize = dataN*sizeof(unsigned int);

	const int BIN = 256;

	const int BIN_Size = BIN*sizeof(unsigned int);

	unsigned int *h_Data = (unsigned int*)malloc(dataSize);

	unsigned int *h_Result = (unsigned int*)malloc(BIN_Size);

	

	memset(h_Result,0,BIN*sizeof(unsigned int));

	srand(2007);

	for(i=0;i<dataN;i++)

  h_Data[i] = rand() % 256;

	for(i=0;i<dataN;i++)

	{

  unsigned int data1;

  data1 = h_Data[i]&0xff;  

  h_Result[data1]++;

	}

	for(i=1;i<BIN;i++)

  h_Result[i] += h_Result[i-1];

	unsigned int *d_Result,*d_Data;

	CUDA_SAFE_CALL( cudaMalloc((void **)&d_Data,   dataSize ) );

	CUDA_SAFE_CALL( cudaMalloc((void **)&d_Result,   BIN_Size ) );

	CUDA_SAFE_CALL( cudaMemcpy(d_Data, h_Data, dataSize, cudaMemcpyHostToDevice) );

	CUDA_SAFE_CALL( cudaMemcpy(d_Result, h_Result, BIN_Size, cudaMemcpyHostToDevice) );

	

	int blocks = iDivUp(dataN,256);

	testKernel<<<blocks,256>>>(d_Data,d_Result,dataN);

	CUT_CHECK_ERROR("testKernel execution failed.\n");

	unsigned int *h_Data_d = (unsigned int*)malloc(dataSize);

	CUDA_SAFE_CALL( cudaMemcpy(h_Data_d, d_Data, dataSize, cudaMemcpyDeviceToHost) );

	computeGold(h_Data,h_Result,dataN);

	int sum = 0;

	for(i=0;i<dataN;i++)

	{  

  printf("%d",h_Data[i]-h_Data_d[i]);

  sum += h_Data[i]-h_Data_d[i];

	}

	printf("the different data Number is: %d\n",sum);    

}

then the mytest_kernel.cu

#ifndef _MYTEST_KERNEL_H_

#define _MYTEST_KERNEL_H_

__global__ void

testKernel( unsigned int* d_Data, unsigned int* d_Result, int dataN) 

{

	const int globalTid = blockDim.x*blockIdx.x+threadIdx.x;

	const int tid = threadIdx.x;

	__shared__ float s_divide[256];

	s_divide[tid] = (float)d_Result[tid]/(float)dataN;

	__syncthreads();

	if(globalTid<dataN)

	{	

  unsigned int data = d_Data[globalTid]&0xff;

  d_Data[globalTid] = s_divide[data]*255.0f;

	}

	__syncthreads();

}

and the mytest_gold.cpp

// export C interface

extern "C" 

void computeGold( unsigned int *h_Data, unsigned int *h_Result, int dataN);

void

computeGold( unsigned int *h_Data, unsigned int *h_Result, int dataN) 

{

	float h_devide[256];

	for(int i=0;i<256;i++)

  h_devide[i] = (float)h_Result[i]/(float)dataN;

	for(int i=0;i<dataN;i++)

	{

  unsigned int data = h_Data[i]&0xff;

  h_Data[i] = h_devide[data]*255.0f;

	}

}

at last, I check the output of the two different computing methods, but find that the result have some tiny different.

is that a bug of CUDA? or I have some mistakes?

thanks for any reply.
PuProject_BUG.rar (348 KB)

AndreiB · December 14, 2007, 6:46am

Tiny differences are okay since GPU and CPU handle floating point numbers differently. CPU’s internal representation of floating point numbers provides higher precision (80 bits) and GPU uses only 32 bits precision.

hakuna · December 14, 2007, 10:21am

yes, tiny differences can be accepted if to all data size or type. but if I change the input size to another size, the differnce disappeared, and that’s the reason puzzle me. In the program, if I change the size to a number which are biger or smaller than 130560, the difference disappeared. and even there are tiny differences, that should not have so visible influence to computing result.

d00sint · December 14, 2007, 11:02am

Hi,
I am glad to see someone else having at least similar problems to mine. I have a bit of cuda code that occasionally fails (computes slightly different results than the CPU) when my input size gets larger than some value (4M i think…). Like hakunas code, mine works perfectly in the emulator. If someone could find the problem in hakunas code that would be great (i tried, but I’m as confused there as I am with my own ;))!

AndreiB · December 14, 2007, 12:22pm

Hmm, that’s strange.

Can you please change your code so that sum accumulates absolute differences, i.e. sum += abs( h_Data[i]-h_Data_d[i])? Differences may be both positive and negative and my guess is that at some pioint they can just cancel each other.

hakuna · December 15, 2007, 1:44pm

I checked the divided value, find that values sometimes are correct but sometimes have tiny difference. In my test program, the GPU result is always smaller by 1 than the CPU result, so the sum is negtive, if change the formula to sum += abs( h_Data[i]-h_Data_d[i]), the result not change. I checked my code many times and can not find the reason cause the difference. In the emudebug mode, the result is correct.

Topic		Replies	Views
CPU and CUDA code yield different results? CUDA Programming and Performance	3	1206	June 28, 2013
Precision Fail CUDA Programming and Performance	5	10597	March 10, 2011
Why do I have the problem of different results every time when I use CUDA for calculations？ CUDA Programming and Performance	5	325	July 24, 2023
Strange float difference in CPU and GPU Announcements	2	1424	December 3, 2014
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	22243	August 7, 2010
[CUDA - CUBLAS] Deviation of computation results increase when calculating larger data CUDA Programming and Performance	2	2050	October 12, 2011
discrepancy between CPU and GPU after a division (accuracy issue) CUDA Programming and Performance	3	1600	June 10, 2015
weird problem CUDA Programming and Performance	11	5620	November 15, 2010
Difference in double precision results? Legacy PGI Compilers	1	2020	August 9, 2010
Different output from emulation and device precision issues on GPU vs CPU CUDA Programming and Performance	9	7388	August 20, 2009

A strange bug about CUDA computing

Related topics