Declaring double precision constants in kernel?

I have a kernel which works if it is float, but if I convert to double, it does not work. The problem seems to be that in a kernel, I can declare a float numeric constant, but not a double numeric constant, where nvcc has a problem.

In particular, here is a float version:

#include <stdio.h>
#include <assert.h>
#include <cuda.h>

void incrementArrayOnHost(float *a, int N)
{
int i;
for (i=0; i < N; i++) a[i] = a[i] + 1.0f;
}

global void incrementArrayOnDevice(float a, int N)
{
int idx = blockIdx.x
blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] + 1.0f;
}

main() {

float *a_h, *b_h;
float *a_d;

int N = 10;
int i, M;

int blockSize=4;
int nBlocks;

M = N * sizeof(float);

/* allocate host arrays */
a_h = (float *) malloc(M);
b_h = (float *) malloc(M);

/* allocate device array */
cudaMalloc((void **) &a_d, M);

/* fill up the host arrays */
for (i=0; i<N; i++) {
	a_h[i] = i + 10.0f; b_h[i] = 0.0f;
}

/* send from the host to the device */
cudaMemcpy(a_d, a_h, M, cudaMemcpyHostToDevice);

incrementArrayOnHost(a_h, N);

nBlocks = N / blockSize;
if (nBlocks * blockSize < N) {
	nBlocks++;
}

incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

/* get the data back from the device */
cudaMemcpy(b_h, a_d, M, cudaMemcpyDeviceToHost);

/* check result */
for (i=0; i<N; i++) { /* assert(a_h[i] == b_h[i]); */
	printf("%g\t%g\t%g\n", a_h[i], b_h[i], a_h[i] - b_h[i]);
}

/* cleanup */
free(a_h);
free(b_h);
cudaFree(a_d);

}

Note that it is a minor modification of an example from Dr. Dobbs. This compiles and runs fine.

Now if we change all the delcarations in the whole source from float to double, and change the 1.0f to 1.0lf in incrementArrayOnHost but leave
the numeric constant as 1.0f in incrementArrayOnDevice:

void incrementArrayOnHost(double *a, int N)
{
int i;
for (i=0; i < N; i++) a[i] = a[i] + 1.0l;
}

global void incrementArrayOnDevice(double a, int N)
{
int idx = blockIdx.x
blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] + (double) 1.0f;
}

then the program compiles and runs, but with this output:

11 10 0.999998
12 11 0.999998
13 12 0.999998
14 13 0.999998
15 14 0.999998
16 15 0.999998
17 16 0.999996
18 17 0.999996
19 18 0.999996
20 19 0.999996

which shows that the 1.0f constant is not getting cast to 1.0lf. If we remove the cast to double, the result is the same.

However if we, as I would expect to be correct, change the numeric constant to 1.0lf:

global void incrementArrayOnDevice(double a, int N)
{
int idx = blockIdx.x
blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] + 1.0l;
}

then this is the result of attempting to compile:

CUDA-1:~/src$ nvcc doublearray.cu -o doublearray
/tmp/tmpxft_00001065_00000000-7_doublearray.cpp3.i(0): ### Compiler Error (user routine ‘_Z22incrementArrayOnDevicePdi’) during Global Optimization – Offline value numbering phase:

Invalid machine type FQ in Targ_Is_Integral

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be returned non-zero status 1

I’ve poked around documentation for a while, and looked to see if any double precision constants were declared in kernels, but I haven’t found the answer; (otherwise I would not be asking here).

Anyone know what the deal is?

I don’t think “1.0lf” means anything. %lf means something in a printf formatting string, but a floating point number with no suffix is defined to be a double. (I tested with MSVC, it returns bad suffix as well in standard C++)

Also you need to compile with -arch sm_13 and run on a GT200.

Sorry, that is a typo in my text. If you look at the code I am using 1.0l, which should be correct.

If you did try and compile with 1.0lf; you get a sensible error message:

error: expected a “;”

But the problem comes when 1.0l is in the code (as in the code in the post). Then you crash with the “INTERNAL ERROR”:

/tmp/tmpxft_0000134f_00000000-7_doublearray.cpp3.i(0): ### Compiler Error (user routine ‘_Z22incrementArrayOnDevicePdi’) during Global Optimization – Offline value numbering phase:

Invalid machine type FQ in Targ_Is_Integral

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be returned non-zero status 1

When the compiler reports an INTERNAL ERROR, it’s almost always a bug.

Adding the compile flag -arch sm_13 doesn’t change things:

CUDA-1:~/src$ nvcc -arch sm_13 doublearray.cu -o doublearray

/tmp/tmpxft_00001375_00000000-7_doublearray.cpp3.i(0): ### Compiler Error (user routine ‘_Z22incrementArrayOnDevicePdi’) during Global Optimization – Offline value numbering phase:

Invalid machine type FQ in Targ_Is_Integral

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be returned non-zero status 1

By the way I am using a GTX 285, so it will be able to run double precision, if the compile can happen.

Technically, 1.0L is a long double which may or may not be any different from the ordinary double; 1.0

Sounds like a bug either way. Does it work with 1.0?

That compiles, but gives the single precision version of 1.0, (which is not what I want). By the way I have checked the CPU code to see that it

is computing the full double precision vaues, and it is, so it is the GPU which is not getting a double precision constant.

I tried using (double) 1.0, and this compiles and runs, using double precision on the CPU but again the GPU is using single precision.

In the case where the compiler crashes (1.0l in the CUDA kernel), the error:

/tmp/tmpxft_00001d4a_00000000-7_doublearray.cpp3.i(0): ### Compiler Error (user routine ‘_Z22incrementArrayOnDevicePdi’) during Global Optimization – Offline value numbering phase:

Invalid machine type FQ in Targ_Is_Integral

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be returned non-zero status 1

seems to be pretty far along in compilation - I would have guessed that replacing numerical constant strings by values happens in the first pass, if not in the preprocessor. (On the other hand I’m coming from about 20 years layoff from thinking about how C compilers really work, and they are probably different enough that I can’t rely on that intuition). So I’m wondering if what is going on is that in the early stage of compilation, nvcc mishandles the literal, setting up the downstream bug - that the optimizer can’t figure out what the earlier pass meant.

But so far, all the ways to get the thing to compile and run are ways that result in single precision constants on the device.

Could be that nvcc takes a wrong turn at an earlier stage - while still parsing - notes the ‘L’ and believes this is an integral long, perhaps?

Anyway, the computational results of double x += 1.f and x += 1,0 as well as x += 1.L happens to be all identical. Optimizing away excess storage of a constant is not erroneous in my book.

It would be pretty bad for the parser to recognize 1.0l as an integer.

It is not a case of optimizing excess storage. It is a case of losing the required precision and returning the wrong answer. The comparison to the CPU computed answers (which I checked are correct) shows the problem.

The constant that results on the GPU is approximately 1.8924e-06 which is a pretty rancid value for 1.0.

I’m pretty sure this works by having a single precision constant stuck in the lower half of a double precision space - the exponents would be OK (they are 0) and this GPU value is pretty close to 2^(-19) = 1.9073e-06. Now if memory serves there should be 53 bits of double precision mantissa and dropping 32 of them would result in 21. Now I think the first bit in the single precision mantissa will be 0 since the number is positive, so that would get me to 2^(-20) as the result of sticking a single precision representation of 1.0 in the bottom 32 bits of a double precision representation. So I’m off by a bit somewhere, but that is what I think is happening.

EDIT: Maybe the half words (is that what we call 32 bit pieces on a 64 bit architecture?) get swapped. Is the compiler doing this because I am in a 64 bit environment and it is mistakenly using some 32 bit code for some reason?

The point is this is not storage optimization, it’s just out and out wrong. I don’t know what exactly is wrong. I remember (and my memory is rusty here) that C compilers make all floating point constants double by default. I can see why nvcc might have decided previously to do something else, but in that case, there should be some syntactic sugar for when I want to have a double precision constant in a kernel.

I think a workaround might be to declare the constant on the CPU and pass it to the GPU but that feels a little silly.

This is pretty weird. Now I have a version that works, and uses 1.0 for both constants. It does require the -arch sm_13 flag, but there’s nothing wrong with that. This is the same source that didn’t work 35 minutes ago (I checked the timestamp on the source.)

Well there is still the issue of the compiler error, but if I can get the result I want I don’t need to pursue that.