Code works with floats but not doubles

BROSE · July 15, 2009, 7:10pm

Hello guys,

I ran into a strange discovery and I was wondering if anyone could offer an explanation. I wrote a simple code that initializes an array in host memory A=[1,2,3,4,5,6,7,8,9,10]

and it performs the necessary steps in calling a kernel for the device to double every element in the array to A =[2,4,6,8,10,12,14,16,18,20]. Now if all of the variables, pointers, and arrays are of type double the code will seem to run, but the kernel changes nothing. If all of the variables are type float, on the other hand, the program works correctly:

The code is:

[codebox]#include <stdio.h>

#include “cuda/cuda.h”

#define BLOCK_SIZE 256

#define SIZE 10

global void ones(float *BxyzG);

int main()

{

float *Bxyz, BxyzA[10];

float *BxyzG;

int i=0;

for(i=0;i<SIZE;i++)

{

BxyzA[i]=(float)(i+1);

}

Bxyz = (float*) malloc(SIZE*sizeof(float));

Bxyz=BxyzA;

for(i=0;i<SIZE;i++)

{

printf(" Defrefrencing Host Pointer: Bxyz[%d] = %f \n", i, *(Bxyz+i));

}

cudaSetDevice(0);

cudaMalloc((void **)&BxyzG, sizeof(float)*SIZE);

cudaMemcpy(BxyzG, Bxyz, sizeof(float)*SIZE, cudaMemcpyHostToDevice);

dim3 dimBlock(BLOCK_SIZE);

dim3 dimGrid ( (SIZE/dimBlock.x) + (!(SIZE%dimBlock.x)?0:1) );

printf(“dimBlock.x: %d dimGrid.x: %d\n”, dimBlock.x, dimGrid.x);

ones<<<dimGrid, dimBlock>>>(BxyzG);

cudaMemcpy(Bxyz, BxyzG, sizeof(float)*SIZE, cudaMemcpyDeviceToHost);

for(i=0;i<SIZE;i++)

{

printf("Bxyz[%d] = %f ", i, Bxyz[i]);

}

return 0;

}

global void ones(float *BxyzG)

{

int i = threadIdx.x;

if(i<SIZE)

{

BxyzG[i] *=2;

}

}[/codebox]

RESULT WHEN THERE ARE DOUBLES:

[codebox][brose@ucla-grendel Desktop]$ ./a.out

Defrefrencing Host Pointer: Bxyz[0] = 1.000000

Defrefrencing Host Pointer: Bxyz[1] = 2.000000

Defrefrencing Host Pointer: Bxyz[2] = 3.000000

Defrefrencing Host Pointer: Bxyz[3] = 4.000000

Defrefrencing Host Pointer: Bxyz[4] = 5.000000

Defrefrencing Host Pointer: Bxyz[5] = 6.000000

Defrefrencing Host Pointer: Bxyz[6] = 7.000000

Defrefrencing Host Pointer: Bxyz[7] = 8.000000

Defrefrencing Host Pointer: Bxyz[8] = 9.000000

Defrefrencing Host Pointer: Bxyz[9] = 10.000000

dimBlock.x: 256 dimGrid.x: 1

Bxyz[0] = 1.000000 Bxyz[1] = 2.000000 Bxyz[2] = 3.000000 Bxyz[3] = 4.000000 Bxyz[4] = 5.000000 Bxyz[5] = 6.000000 Bxyz[6] = 7.000000 Bxyz[7] = 8.000000 Bxyz[8] = 9.000000 Bxyz[9] = 10.000000 [/codebox]

THIS IS WRONG because unchanged but when I literally use “replace with” to replace double with float:

[codebox] Defrefrencing Host Pointer: Bxyz[0] = 1.000000

Defrefrencing Host Pointer: Bxyz[1] = 2.000000

Defrefrencing Host Pointer: Bxyz[2] = 3.000000

Defrefrencing Host Pointer: Bxyz[3] = 4.000000

Defrefrencing Host Pointer: Bxyz[4] = 5.000000

Defrefrencing Host Pointer: Bxyz[5] = 6.000000

Defrefrencing Host Pointer: Bxyz[6] = 7.000000

Defrefrencing Host Pointer: Bxyz[7] = 8.000000

Defrefrencing Host Pointer: Bxyz[8] = 9.000000

Defrefrencing Host Pointer: Bxyz[9] = 10.000000

dimBlock.x: 256 dimGrid.x: 1

Bxyz[0] = 2.000000 Bxyz[1] = 4.000000 Bxyz[2] = 6.000000 Bxyz[3] = 8.000000 Bxyz[4] = 10.000000 Bxyz[5] = 12.000000 Bxyz[6] = 14.000000 Bxyz[7] = 16.000000 Bxyz[8] = 18.000000 Bxyz[9] = 20.000000 [/codebox]

This works and I have no idea why. I am using Tesla C870.

Nico · July 15, 2009, 7:36pm

Tesla C870 has compute capability 1.0 and does not support doubles.

N.

PTThompson · July 15, 2009, 7:47pm

Even if that’s the case, the doubles should simply be converted to floats, and execution should continue as expected, according to the documentation. I’m seeing this behavior on my cluster node as well, which has Tesla T10 processors. It definitely has compute 1.3, so I really doubt that the problem is that simple.

[codebox]There are 4 devices supporting CUDA

Device 0: “Tesla T10 Processor”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Device 1: “Tesla T10 Processor”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Device 2: “Tesla T10 Processor”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Device 3: “Tesla T10 Processor”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.44 GHz

Concurrent copy and execution: Yes

Test PASSED

[/codebox]

Nico · July 15, 2009, 7:51pm

I know that doubles are converted to floats for literals: e.g. 1.0 becomes 1.0f, but I’m not sure if this is the case when using arrays, where the effective stride between elements is 8 bytes instead of 4.
Other than that, I don’t see anything wrong with the code, so I’d expect it to run fine on Tesla10 series…, I’ll check it out on my GTX285 tomorrow.

Did you compile it with “-arch=sm_13”?

N.

PTThompson · July 15, 2009, 8:18pm

adding the arch option solves the error for me on the cluster node.

Topic		Replies	Views
Strange change in behaviour between float and double CUDA Programming and Performance	6	1379	April 1, 2009
Problems passing doubles to/from kernel - they become 0! CUDA Programming and Performance	2	1658	November 20, 2008
worked fine for "int" "float" but NOT "double" CUDA Programming and Performance	13	5082	March 9, 2009
double array or float array CUDA Programming and Performance	2	849	January 10, 2014
Variable type problem CUDA Programming and Performance	0	2824	July 20, 2009
using double in kernel Rob Farber's Dr.Dobb's code using double CUDA Programming and Performance	3	7234	January 3, 2009
This is driving me nuts! memory access problem.. CUDA Programming and Performance	5	2703	December 7, 2007
A possible nvcc bug on double? CUDA Programming and Performance	1	2583	March 16, 2011
Double precision in CUDA 2.3 CUDA Programming and Performance	5	38226	March 5, 2010
float / double issue CUDA Programming and Performance	12	22200	December 31, 2010

Code works with floats but not doubles

Related topics