Code works with floats but not doubles

Hello guys,

I ran into a strange discovery and I was wondering if anyone could offer an explanation. I wrote a simple code that initializes an array in host memory A=[1,2,3,4,5,6,7,8,9,10]

and it performs the necessary steps in calling a kernel for the device to double every element in the array to A =[2,4,6,8,10,12,14,16,18,20]. Now if all of the variables, pointers, and arrays are of type double the code will seem to run, but the kernel changes nothing. If all of the variables are type float, on the other hand, the program works correctly:

The code is:

[codebox]#include <stdio.h>

#include “cuda/cuda.h”

#define BLOCK_SIZE 256

#define SIZE 10

global void ones(float *BxyzG);

int main()

{

float *Bxyz, BxyzA[10];

float *BxyzG;

int i=0;

for(i=0;i<SIZE;i++)

{

BxyzA[i]=(float)(i+1);

}

Bxyz = (float*) malloc(SIZE*sizeof(float));

Bxyz=BxyzA;

for(i=0;i<SIZE;i++)

{

printf(" Defrefrencing Host Pointer: Bxyz[%d] = %f \n", i, *(Bxyz+i));

}

cudaSetDevice(0);

cudaMalloc((void **)&BxyzG, sizeof(float)*SIZE);

cudaMemcpy(BxyzG, Bxyz, sizeof(float)*SIZE, cudaMemcpyHostToDevice);

dim3 dimBlock(BLOCK_SIZE);

dim3 dimGrid ( (SIZE/dimBlock.x) + (!(SIZE%dimBlock.x)?0:1) );

printf(“dimBlock.x: %d dimGrid.x: %d\n”, dimBlock.x, dimGrid.x);

ones<<<dimGrid, dimBlock>>>(BxyzG);

cudaMemcpy(Bxyz, BxyzG, sizeof(float)*SIZE, cudaMemcpyDeviceToHost);

for(i=0;i<SIZE;i++)

{

printf("Bxyz[%d] = %f ", i, Bxyz[i]);

}

return 0;

}

global void ones(float *BxyzG)

{

int i = threadIdx.x;

if(i<SIZE)

{

BxyzG[i] *=2;

}

}[/codebox]

RESULT WHEN THERE ARE DOUBLES:

[codebox][brose@ucla-grendel Desktop]$ ./a.out

Defrefrencing Host Pointer: Bxyz[0] = 1.000000

Defrefrencing Host Pointer: Bxyz[1] = 2.000000

Defrefrencing Host Pointer: Bxyz[2] = 3.000000

Defrefrencing Host Pointer: Bxyz[3] = 4.000000

Defrefrencing Host Pointer: Bxyz[4] = 5.000000

Defrefrencing Host Pointer: Bxyz[5] = 6.000000

Defrefrencing Host Pointer: Bxyz[6] = 7.000000

Defrefrencing Host Pointer: Bxyz[7] = 8.000000

Defrefrencing Host Pointer: Bxyz[8] = 9.000000

Defrefrencing Host Pointer: Bxyz[9] = 10.000000

dimBlock.x: 256 dimGrid.x: 1

Bxyz[0] = 1.000000 Bxyz[1] = 2.000000 Bxyz[2] = 3.000000 Bxyz[3] = 4.000000 Bxyz[4] = 5.000000 Bxyz[5] = 6.000000 Bxyz[6] = 7.000000 Bxyz[7] = 8.000000 Bxyz[8] = 9.000000 Bxyz[9] = 10.000000 [/codebox]

THIS IS WRONG because unchanged but when I literally use “replace with” to replace double with float:

[codebox] Defrefrencing Host Pointer: Bxyz[0] = 1.000000

Defrefrencing Host Pointer: Bxyz[1] = 2.000000

Defrefrencing Host Pointer: Bxyz[2] = 3.000000

Defrefrencing Host Pointer: Bxyz[3] = 4.000000

Defrefrencing Host Pointer: Bxyz[4] = 5.000000

Defrefrencing Host Pointer: Bxyz[5] = 6.000000

Defrefrencing Host Pointer: Bxyz[6] = 7.000000

Defrefrencing Host Pointer: Bxyz[7] = 8.000000

Defrefrencing Host Pointer: Bxyz[8] = 9.000000

Defrefrencing Host Pointer: Bxyz[9] = 10.000000

dimBlock.x: 256 dimGrid.x: 1

Bxyz[0] = 2.000000 Bxyz[1] = 4.000000 Bxyz[2] = 6.000000 Bxyz[3] = 8.000000 Bxyz[4] = 10.000000 Bxyz[5] = 12.000000 Bxyz[6] = 14.000000 Bxyz[7] = 16.000000 Bxyz[8] = 18.000000 Bxyz[9] = 20.000000 [/codebox]

This works and I have no idea why. I am using Tesla C870.

Tesla C870 has compute capability 1.0 and does not support doubles.

N.

Even if that’s the case, the doubles should simply be converted to floats, and execution should continue as expected, according to the documentation. I’m seeing this behavior on my cluster node as well, which has Tesla T10 processors. It definitely has compute 1.3, so I really doubt that the problem is that simple.

[codebox]There are 4 devices supporting CUDA

Device 0: “Tesla T10 Processor”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Device 1: “Tesla T10 Processor”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Device 2: “Tesla T10 Processor”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Device 3: “Tesla T10 Processor”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Test PASSED

[/codebox]

I know that doubles are converted to floats for literals: e.g. 1.0 becomes 1.0f, but I’m not sure if this is the case when using arrays, where the effective stride between elements is 8 bytes instead of 4.
Other than that, I don’t see anything wrong with the code, so I’d expect it to run fine on Tesla10 series…, I’ll check it out on my GTX285 tomorrow.

Did you compile it with “-arch=sm_13”?

N.

adding the arch option solves the error for me on the cluster node.