Problems with doubles on GTX280 Emu works, float works, double on device fails.

Hi all,

I am implementing some basic functions for sparse matrices. In order to have one memory transfer for a CSR formatted matrix I use a struct of pointers.

After getting strange results I narrowed the problem down to copying double data.

The following test case works for me in emulation mode (single and double precision) in single precision on the device but not as it is presented here on the device.

#include <stdio.h> 

#define dim 7

typedef struct {   //[int,int,int,dim*double,dim*int]

	void* memory;	

	int* Int1;	  

	int* Int2;	

	int* Int3;	

	double* dblVector;

	int* intVector;	 

} myStruct;

__host__ void setPointers(myStruct* A)

{

	(*A).Int1 = (int*)(*A).memory;

	(*A).Int2   = (*A).Int1+1;

	(*A).Int3   = (*A).Int2+1;

	(*A).dblVector	= (double*) ((*A).Int3+1);

	(*A).intVector	= (int*) ((*A).dblVector+dim);

}

__global__ void spmv_gpu(int* intVector, double* dblVector, double* dblResult, int* intResult)

{

	dblResult[threadIdx.x] = dblVector[threadIdx.x];

	intResult[threadIdx.x] = intVector[threadIdx.x];

}

int main(int argc,char* argv[])

{

	cudaSetDevice(0);

	int i;

//Create a struct on the device

	myStruct A_d;

	cudaMalloc((void**)&(A_d.memory), sizeof(double)*dim+sizeof(int)*(3+dim));

	setPointers(&A_d);

//Create a struct on the host

	myStruct A_h;

	cudaMallocHost((void**)&(A_h.memory), sizeof(double)*dim+sizeof(int)*(3+dim));

	setPointers(&A_h);

	

//...fill it with dummy data

	*A_h.Int1=11;	*A_h.Int2=21;	*A_h.Int3=31;

	for (i=0; i<dim;i++) A_h.dblVector[i]=sqrt(i);

	for ( i=0; i<dim;i++) A_h.intVector[i]=(i+4)*10+1;

	

//print this data

	for(i = 0;i < dim; i++){

	 	   printf("%d: %e\n",A_h.intVector[i], A_h.dblVector[i]);

	  }

	printf("\n");

//copy data to device	

	cudaMemcpy(A_d.memory, A_h.memory, sizeof(double)*dim+sizeof(int)*(3+dim), cudaMemcpyHostToDevice);

//copy contents of dummy vectors to other areas of memory 			

	double* dblResult_d;

	int* intResult_d;

	cudaMalloc((void**)&dblResult_d, sizeof(double)*dim);

	cudaMalloc((void**)&intResult_d, sizeof(int)*dim);

	spmv_gpu<<<1,dim>>>(A_d.intVector, A_d.dblVector, dblResult_d, intResult_d);

//read the results back

	double dblResult[dim]={0.0};

	int intResult[dim]={0};	

	cudaMemcpy(dblResult, dblResult_d, sizeof(double)*dim, cudaMemcpyDeviceToHost);

	cudaMemcpy(intResult, intResult_d, sizeof(int)*dim, cudaMemcpyDeviceToHost);

	

//print the results

 	for(i = 0;i < dim; i++){

			 printf("%d: %e\n", intResult[i], dblResult[i]);

	  }

	cudaFreeHost(A_h.memory);

	cudaFree(A_d.memory);

	cudaFree(dblResult_d);

	cudaFree(intResult_d);

}

The result on both my ASUS ENGTX280 is

41: 0.000000e+00

51: 1.000000e+00

61: 1.414214e+00

71: 1.732051e+00

81: 2.000000e+00

91: 2.236068e+00

101: 2.449490e+00

41: 1.531604e-322

51: 0.000000e+00

61: 5.308593e+185

71: -4.434596e+194

81: 5.303601e-315

91: -9.458696e-176

101: 3.732320e-212

So the integer data is copied correctly, double is not. Am I doing something wrong?

I’m using driver version 177.73 on OpenSuse 10.2 (kernel 2.6.26.5) and am compiling with -arch sm_13 and -m64.

Thanks in advance for any help.

–Tobias

Edits:

  • PTX looks fine, it loads f64 from the correct address and stores it also to f64.

  • It works if I leave out one of the three integers, let’s say Int2, set the pointer to Int3 new and transfer 4 bytes less to and from the device. Strange?!

did you compile with the -arch sm_13 option?

Of course, I did. :)

I tested now with 2 and 4 integers preceding the double vector and it worked.
In the version with 3 ints, the first double is at 0xc and produces wrong results, the 0x0 and 0x***8 above work fine.

Could it be, that double values are only accessible if their addresses are divisible by 8?

Edit: I checked 1 and 5 preceding ints meanwhile and got errors again. Only even numbers of preceding ints seem to work.

doubles have to be 8-byte aligned, just like in host code.

For Linux it’s 4-bytes only, this is why the emulation worked all the time.

Thanks, the problem will now be easy to solve.