Declaring double precision constants in kernel?

zenpharaohs · July 18, 2009, 4:50pm

I have a kernel which works if it is float, but if I convert to double, it does not work. The problem seems to be that in a kernel, I can declare a float numeric constant, but not a double numeric constant, where nvcc has a problem.

In particular, here is a float version:

#include <stdio.h>
#include <assert.h>
#include <cuda.h>

void incrementArrayOnHost(float *a, int N)
{
int i;
for (i=0; i < N; i++) a[i] = a[i] + 1.0f;
}

global void incrementArrayOnDevice(float a, int N)
{
int idx = blockIdx.xblockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] + 1.0f;
}

main() {

float *a_h, *b_h;
float *a_d;

int N = 10;
int i, M;

int blockSize=4;
int nBlocks;

M = N * sizeof(float);

/* allocate host arrays */
a_h = (float *) malloc(M);
b_h = (float *) malloc(M);

/* allocate device array */
cudaMalloc((void **) &a_d, M);

/* fill up the host arrays */
for (i=0; i<N; i++) {
	a_h[i] = i + 10.0f; b_h[i] = 0.0f;
}

/* send from the host to the device */
cudaMemcpy(a_d, a_h, M, cudaMemcpyHostToDevice);

incrementArrayOnHost(a_h, N);

nBlocks = N / blockSize;
if (nBlocks * blockSize < N) {
	nBlocks++;
}

incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

/* get the data back from the device */
cudaMemcpy(b_h, a_d, M, cudaMemcpyDeviceToHost);

/* check result */
for (i=0; i<N; i++) { /* assert(a_h[i] == b_h[i]); */
	printf("%g\t%g\t%g\n", a_h[i], b_h[i], a_h[i] - b_h[i]);
}

/* cleanup */
free(a_h);
free(b_h);
cudaFree(a_d);

}

Note that it is a minor modification of an example from Dr. Dobbs. This compiles and runs fine.

Now if we change all the delcarations in the whole source from float to double, and change the 1.0f to 1.0lf in incrementArrayOnHost but leave
the numeric constant as 1.0f in incrementArrayOnDevice:

void incrementArrayOnHost(double *a, int N)
{
int i;
for (i=0; i < N; i++) a[i] = a[i] + 1.0l;
}

global void incrementArrayOnDevice(double a, int N)
{
int idx = blockIdx.xblockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] + (double) 1.0f;
}

then the program compiles and runs, but with this output:

11 10 0.999998
12 11 0.999998
13 12 0.999998
14 13 0.999998
15 14 0.999998
16 15 0.999998
17 16 0.999996
18 17 0.999996
19 18 0.999996
20 19 0.999996

which shows that the 1.0f constant is not getting cast to 1.0lf. If we remove the cast to double, the result is the same.

However if we, as I would expect to be correct, change the numeric constant to 1.0lf:

global void incrementArrayOnDevice(double a, int N)
{
int idx = blockIdx.xblockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] + 1.0l;
}

then this is the result of attempting to compile:

CUDA-1:~/src$ nvcc doublearray.cu -o doublearray
/tmp/tmpxft_00001065_00000000-7_doublearray.cpp3.i(0): ### Compiler Error (user routine ‘_Z22incrementArrayOnDevicePdi’) during Global Optimization – Offline value numbering phase:

Invalid machine type FQ in Targ_Is_Integral

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be returned non-zero status 1

I’ve poked around documentation for a while, and looked to see if any double precision constants were declared in kernels, but I haven’t found the answer; (otherwise I would not be asking here).

Anyone know what the deal is?

tmurray · July 18, 2009, 5:02pm

I don’t think “1.0lf” means anything. %lf means something in a printf formatting string, but a floating point number with no suffix is defined to be a double. (I tested with MSVC, it returns bad suffix as well in standard C++)

Also you need to compile with -arch sm_13 and run on a GT200.

zenpharaohs · July 18, 2009, 5:17pm

Sorry, that is a typo in my text. If you look at the code I am using 1.0l, which should be correct.

If you did try and compile with 1.0lf; you get a sensible error message:

error: expected a “;”

But the problem comes when 1.0l is in the code (as in the code in the post). Then you crash with the “INTERNAL ERROR”:

/tmp/tmpxft_0000134f_00000000-7_doublearray.cpp3.i(0): ### Compiler Error (user routine ‘_Z22incrementArrayOnDevicePdi’) during Global Optimization – Offline value numbering phase:

Invalid machine type FQ in Targ_Is_Integral

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be returned non-zero status 1

When the compiler reports an INTERNAL ERROR, it’s almost always a bug.

Adding the compile flag -arch sm_13 doesn’t change things:

CUDA-1:~/src$ nvcc -arch sm_13 doublearray.cu -o doublearray

/tmp/tmpxft_00001375_00000000-7_doublearray.cpp3.i(0): ### Compiler Error (user routine ‘_Z22incrementArrayOnDevicePdi’) during Global Optimization – Offline value numbering phase:

Invalid machine type FQ in Targ_Is_Integral

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be returned non-zero status 1

By the way I am using a GTX 285, so it will be able to run double precision, if the compile can happen.

jma · July 18, 2009, 6:10pm

Technically, 1.0L is a long double which may or may not be any different from the ordinary double; 1.0

Sounds like a bug either way. Does it work with 1.0?

zenpharaohs · July 18, 2009, 6:40pm

That compiles, but gives the single precision version of 1.0, (which is not what I want). By the way I have checked the CPU code to see that it

is computing the full double precision vaues, and it is, so it is the GPU which is not getting a double precision constant.

I tried using (double) 1.0, and this compiles and runs, using double precision on the CPU but again the GPU is using single precision.

In the case where the compiler crashes (1.0l in the CUDA kernel), the error:

/tmp/tmpxft_00001d4a_00000000-7_doublearray.cpp3.i(0): ### Compiler Error (user routine ‘_Z22incrementArrayOnDevicePdi’) during Global Optimization – Offline value numbering phase:

Invalid machine type FQ in Targ_Is_Integral

nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be returned non-zero status 1

seems to be pretty far along in compilation - I would have guessed that replacing numerical constant strings by values happens in the first pass, if not in the preprocessor. (On the other hand I’m coming from about 20 years layoff from thinking about how C compilers really work, and they are probably different enough that I can’t rely on that intuition). So I’m wondering if what is going on is that in the early stage of compilation, nvcc mishandles the literal, setting up the downstream bug - that the optimizer can’t figure out what the earlier pass meant.

But so far, all the ways to get the thing to compile and run are ways that result in single precision constants on the device.

jma · July 18, 2009, 7:05pm

Could be that nvcc takes a wrong turn at an earlier stage - while still parsing - notes the ‘L’ and believes this is an integral long, perhaps?

Anyway, the computational results of double x += 1.f and x += 1,0 as well as x += 1.L happens to be all identical. Optimizing away excess storage of a constant is not erroneous in my book.

zenpharaohs · July 18, 2009, 7:31pm

It would be pretty bad for the parser to recognize 1.0l as an integer.

It is not a case of optimizing excess storage. It is a case of losing the required precision and returning the wrong answer. The comparison to the CPU computed answers (which I checked are correct) shows the problem.

The constant that results on the GPU is approximately 1.8924e-06 which is a pretty rancid value for 1.0.

I’m pretty sure this works by having a single precision constant stuck in the lower half of a double precision space - the exponents would be OK (they are 0) and this GPU value is pretty close to 2^(-19) = 1.9073e-06. Now if memory serves there should be 53 bits of double precision mantissa and dropping 32 of them would result in 21. Now I think the first bit in the single precision mantissa will be 0 since the number is positive, so that would get me to 2^(-20) as the result of sticking a single precision representation of 1.0 in the bottom 32 bits of a double precision representation. So I’m off by a bit somewhere, but that is what I think is happening.

EDIT: Maybe the half words (is that what we call 32 bit pieces on a 64 bit architecture?) get swapped. Is the compiler doing this because I am in a 64 bit environment and it is mistakenly using some 32 bit code for some reason?

The point is this is not storage optimization, it’s just out and out wrong. I don’t know what exactly is wrong. I remember (and my memory is rusty here) that C compilers make all floating point constants double by default. I can see why nvcc might have decided previously to do something else, but in that case, there should be some syntactic sugar for when I want to have a double precision constant in a kernel.

I think a workaround might be to declare the constant on the CPU and pass it to the GPU but that feels a little silly.

zenpharaohs · July 18, 2009, 8:07pm

This is pretty weird. Now I have a version that works, and uses 1.0 for both constants. It does require the -arch sm_13 flag, but there’s nothing wrong with that. This is the same source that didn’t work 35 minutes ago (I checked the timestamp on the source.)

Well there is still the issue of the compiler error, but if I can get the result I want I don’t need to pursue that.

Topic		Replies	Views
Problem with double precision matrix float works, double doesn't CUDA Programming and Performance	7	3915	April 22, 2009
Double Precision Help... Double precision CUDA Programming and Performance	6	5035	September 1, 2011
Wrong output in double precision CUDA Programming and Performance	20	5909	May 20, 2011
The kernel isn't working CUDA Programming and Performance	9	1050	January 19, 2011
What is code compiled with -arch=sm_13 slower? CUDA Programming and Performance	16	3662	April 22, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	669	April 4, 2017
Accuracy in GPU floating point calculations CUDA Programming and Performance	35	8204	September 9, 2011
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10516	November 26, 2009
Help with 'double precision' CUDA Programming and Performance	8	2902	July 29, 2008
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	920	October 16, 2023

Declaring double precision constants in kernel?

Invalid machine type FQ in Targ_Is_Integral

Invalid machine type FQ in Targ_Is_Integral

Invalid machine type FQ in Targ_Is_Integral

Invalid machine type FQ in Targ_Is_Integral

Related topics