Calculation Rounding

Hi, I’m very new to CUDA and just had a relatively simple question that i can’t seem to figure out.

In my kernel function im calculating something like this:

dataIndex = ((512*512)zCoord) + (512yCoord) + xCoord;

where xCoord, yCoord and zCoord range from say 50 to 400.

When i look at the values produced by this calculation that occurs on the device they are truncated and are not exactly the same as if the calculation was done by the cpu e.g

96053975 by the cpu compared to 9.6054e+007.

Could anyone tell me why this is?

I am working on a 8800 GTS G80 architecture card. Any help would be much appreciated.

These variables are all integers, right? I’m confused why your answer on the GPU is reported as a float.

Show us the full code. The declarations of the particular variables involved are likely to be particularly important. My first guess is that dataIndex is a float variable, and we’re seeing rounding. Possibly even rounding my the print statement, rather than anything numerical.

Yeh basically all the original values were floats but are cast to ints, so the values of xCoord, yCoord and zCoord are now ints.

I checked the values of the variables when they were casted as ints and they are reported as ints.

However, the array that the dataIndex value is passed into is of floats, im starting to think this may be the problem. Yet why would this be the case, if when i do exactly the same on the CPU they

come back complete?

If this expression is somehow getting cast to a float by the compiler, then the problem is that on the GPU it will be done with single precision, which only has 24 bits of mantissa. Your final value exceeds this precision, so you will start to lose digits. On the CPU, it will most likely use double precision, which has 53 bits of mantissa. Integers have 31 bits of mantissa (in a manner of speaking), which is plenty for this calculation.

So, yes, we need a little more context to understand why the compiler is computing your index expression with floating point, rather than integers.

The code is

global void calculateInitialDensity(float *v, float * initialD, float * densities, float dx, float dy, float dz, int as)

{

int idx=blockIdx.x*blockDim.x+threadIdx.x;

            int idy=blockIdx.y*blockDim.y+threadIdx.y;

int index = idy*as+idx;

int xCoord; // data x index coord

int yCoord; // data y index coord

int zCoord; // data z index coord

float currX; // node X value

float currY; // node Y value

float currZ; // node Z value

int dataIndex=0;

int check=index%3;

if(check==0){

	currX=(v[index]/dx)+0.5;

	currY=(v[index+1]/dy)+0.5;

	currZ=(v[index+2]/dz)+0.5;

	xCoord = (int)currX;

	yCoord = (int)currY;

	zCoord = (int)currZ;

	dataIndex = ((512*512)*zCoord)+(512*yCoord)+xCoord;

	initialD[index/3] = densities[dataIndex];



}

}

extern “C” int * initialDensityCalc(float *vArray,float * densities, float * initD, int vSize, int dSize, int nodes, float dx, float dy, float dz)

{

float *verticesArray_device;

float *initialDensityArray_device;

            float *densities_device;

CUDA_SAFE_CALL(cudaMalloc((void**)&verticesArray_device, vSize*sizeof(float)));

CUDA_SAFE_CALL(cudaMalloc((void**)&initialDensityArray_device, nodes*sizeof(float)));

            CUDA_SAFE_CALL(cudaMalloc((void**)&densities_device, dSize*sizeof(float)));

CUDA_SAFE_CALL(cudaMemcpy(verticesArray_device, vArray,vSize*sizeof(float), cudaMemcpyHostToDevice));

            CUDA_SAFE_CALL(cudaMemcpy(densities_device, densities,dSize*sizeof(float), cudaMemcpyHostToDevice));

printf("before block size");

int N = vSize;

int block_size = 256;

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

printf("calling kernel, blocksize is %d",n_blocks);

calculateInitialDensity<<<n_blocks,block_size>>>(verticesArray_device,initialDensityArray_device, densities_device, dx,dy, dz, vSize); 

CUDA_SAFE_CALL( cudaThreadSynchronize() );

CUT_CHECK_ERROR("Kernel execution failed\n");

checkCUDAError1("kernel invocation");



printf("this is after kernel call");

CUDA_SAFE_CALL(cudaMemcpy(initD, initialDensityArray_device, sizeof(float)*nodes, cudaMemcpyDeviceToHost));

checkCUDAError1("memcpy");



CUDA_SAFE_CALL(cudaFree(initialDensityArray_device));

CUDA_SAFE_CALL(cudaFree(verticesArray_device));

printf("returning something");

return initD;

}