I have a kernel which works if it is float, but if I convert to double, it does not work. The problem seems to be that in a kernel, I can declare a float numeric constant, but not a double numeric constant, where nvcc has a problem.
In particular, here is a float version:
#include <stdio.h>
#include <assert.h>
#include <cuda.h>
void incrementArrayOnHost(float *a, int N)
{
int i;
for (i=0; i < N; i++) a[i] = a[i] + 1.0f;
}
global void incrementArrayOnDevice(float a, int N)
{
int idx = blockIdx.xblockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] + 1.0f;
}
main() {
float *a_h, *b_h;
float *a_d;
int N = 10;
int i, M;
int blockSize=4;
int nBlocks;
M = N * sizeof(float);
/* allocate host arrays */
a_h = (float *) malloc(M);
b_h = (float *) malloc(M);
/* allocate device array */
cudaMalloc((void **) &a_d, M);
/* fill up the host arrays */
for (i=0; i<N; i++) {
a_h[i] = i + 10.0f; b_h[i] = 0.0f;
}
/* send from the host to the device */
cudaMemcpy(a_d, a_h, M, cudaMemcpyHostToDevice);
incrementArrayOnHost(a_h, N);
nBlocks = N / blockSize;
if (nBlocks * blockSize < N) {
nBlocks++;
}
incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);
/* get the data back from the device */
cudaMemcpy(b_h, a_d, M, cudaMemcpyDeviceToHost);
/* check result */
for (i=0; i<N; i++) { /* assert(a_h[i] == b_h[i]); */
printf("%g\t%g\t%g\n", a_h[i], b_h[i], a_h[i] - b_h[i]);
}
/* cleanup */
free(a_h);
free(b_h);
cudaFree(a_d);
}
Note that it is a minor modification of an example from Dr. Dobbs. This compiles and runs fine.
Now if we change all the delcarations in the whole source from float to double, and change the 1.0f to 1.0lf in incrementArrayOnHost but leave
the numeric constant as 1.0f in incrementArrayOnDevice:
void incrementArrayOnHost(double *a, int N)
{
int i;
for (i=0; i < N; i++) a[i] = a[i] + 1.0l;
}
global void incrementArrayOnDevice(double a, int N)
{
int idx = blockIdx.xblockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] + (double) 1.0f;
}
then the program compiles and runs, but with this output:
11 10 0.999998
12 11 0.999998
13 12 0.999998
14 13 0.999998
15 14 0.999998
16 15 0.999998
17 16 0.999996
18 17 0.999996
19 18 0.999996
20 19 0.999996
which shows that the 1.0f constant is not getting cast to 1.0lf. If we remove the cast to double, the result is the same.
However if we, as I would expect to be correct, change the numeric constant to 1.0lf:
global void incrementArrayOnDevice(double a, int N)
{
int idx = blockIdx.xblockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] + 1.0l;
}
then this is the result of attempting to compile:
CUDA-1:~/src$ nvcc doublearray.cu -o doublearray
/tmp/tmpxft_00001065_00000000-7_doublearray.cpp3.i(0): ### Compiler Error (user routine ‘_Z22incrementArrayOnDevicePdi’) during Global Optimization – Offline value numbering phase:
Invalid machine type FQ in Targ_Is_Integral
nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be returned non-zero status 1
I’ve poked around documentation for a while, and looked to see if any double precision constants were declared in kernels, but I haven’t found the answer; (otherwise I would not be asking here).
Anyone know what the deal is?