Support for long double data type in GPU code

I was curious whether the long double data type (128 bit float) was supported on my GPU, RTX 4070 with CUDA 12.5.
My understanding from some reading is it’s supported only for host side compatibility such that a long double is allocated as 16 bytes, but it’s still treated as double (8 bytes)
But I’m noticing something odd.

I declared an array of long double in GPU code as
long double xx[2] = {1.0L / 10.0L, 2.0L / 10.0L};

I stepped thru the GPU code with cuda-gdb, to just past the declaration of the variable then displayed the addresses and contents of the array elements. I got this output from cuda-gdb

99          long double xx[2] = { 1.0L / 10.0L, 2.0L / 10.0L };
(cuda-gdb) 
100         int x = idx % IMAGE_WIDTH;
(cuda-gdb) whatis xx
type = @local double [2]
(cuda-gdb) p &xx[0]
$1 = (@local double *) 0xfffcd0
(cuda-gdb) p &xx[1]
$2 = (@local double *) 0xfffce0
(cuda-gdb) x/8xw xx
0xfffcd0:       0x9999999a      0x3fb99999      0x9999999a      0x3fc99999
0xfffce0:       0x00000000      0x00000000      0x00000010      0x00000000

It looks like cuda-gdb’s interpretation of debugger symbols maps array elements at 16 byte boundaries, matching host code, but when I look at the memory contents it looks like they are actually mapped as two contiguous 8 byte elements not two contiguous 16 byte elements.
Am I missing something?

The use of long double type is not supported in device code.

Note that the C+ standard does not require that long double is mapped to a 128-bit floating-point type:

the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.

Example mappings used for long double are:
(1) long double maps to double
(2) long double maps to an 80-bit IEEE-754 double extended precision format
(3) long double maps to a double-double (pair-precision) format

GPUs have no hardware support for IEEE-754 binary128 format and operations thereon, so any potential support would have to be provided by a software library. If desired, you could file an enhancement request with NVIDIA via the bug reporting mechanism.

I had concluded long double wasn’t actually 128 bit float based on some reading and not seeing any significant change in runtime between double and long double. I understand that if I’m doing something that is unsupported then the system can do whatever it wants.
However it seems odd that cuda-gdb seems to have two different ideas about the mapping of something declared long double, so I thought that could be a bug.

Leaving cuda-gdb out of it, since there is no support for long double in CUDA device code, if you use it in CUDA device code anyway, I guess I would suggest to expect UB. All bets are off.

What cuda-gdb does in that case, I don’t know. But the folks on the cuda-gdb forum may know. However my guess would be that cuda-gdb is behaving in a sensible fashion, and what you are witnessing is some combination of the fact that

and the warning you get when you try to compile this test case:

$ cat t3.cu
#include <cstdio>
__global__ void k(long double *d){

  long double xx[2] = {1.0L / 10.0L, 2.0L / 10.0L};
  unsigned char *lxx = reinterpret_cast<unsigned char *>(xx);
  unsigned char *ld = reinterpret_cast<unsigned char *>(d);
  for (int i = 0; i < 32; i++)
    printf("%d, %d\n", (int)(lxx[i]), (int)(ld[i]));
}

int main(){

  long double xx[2] = {1.0L / 10.0L, 2.0L / 10.0L};
  long double *d;
  cudaMalloc(&d, 2*sizeof(long double));
  cudaMemcpy(d, xx, 2*sizeof(long double), cudaMemcpyHostToDevice);
  k<<<1,1>>>(d);
  cudaDeviceSynchronize();
  double xxd[4] = {1.0L / 10.0L, 2.0L / 10.0L, 0, 0};
  unsigned char *lxx = reinterpret_cast<unsigned char *>(xx);
  unsigned char *lxxd = reinterpret_cast<unsigned char *>(xxd);
  for (int i = 0; i < 32; i++)
    printf("%d,%d\n", (int)(lxxd[i]), (int)(lxx[i]));
}

$ nvcc -o t3 t3.cu
t3.cu(4): warning #20208-D: 'long double' is treated as 'double' in device code
    long double xx[2] = {1.0L / 10.0L, 2.0L / 10.0L};
    ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

t3.cu(4): warning #20208-D: 'long double' is treated as 'double' in device code

t3.cu(4): warning #20208-D: 'long double' is treated as 'double' in device code

$ compute-sanitizer ./t3
========= COMPUTE-SANITIZER
154, 205
153, 204
153, 204
153, 204
153, 204
153, 204
185, 204
63, 204
154, 251
153, 63
153, 52
153, 239
153, 217
153, 73
201, 21
63, 181
0, 205
0, 204
0, 204
0, 204
0, 204
0, 204
0, 204
0, 204
0, 252
0, 63
0, 0
0, 0
0, 0
0, 0
0, 0
0, 0
154,205
153,204
153,204
153,204
153,204
153,204
185,204
63,204
154,251
153,63
153,52
153,239
153,217
153,73
201,21
63,181
0,205
0,204
0,204
0,204
0,204
0,204
0,204
0,204
0,252
0,63
0,0
0,0
0,0
0,0
0,0
0,0
========= ERROR SUMMARY: 0 errors
$

(CUDA 12.5)

Of course, we already have the statement that its unsupported, and therefore if you try to use it, it might be UB, and therefore writing a test case and then trying to infer something from it might be a fool’s errand. However, if we push past that, it seems to me that:

  • The behavior of that long double declaration in host code and in device code manifests differently.
  • The behavior of the declaration in device code might be indicated by the warning, whatever that warning means. It looks fairly literal, to me. It appears to behave almost exactly like you replace long double with double. (You could nitpick about the total size of the storage space, perhaps.)
  • I don’t see any evidence that cuda-gdb is doing anything untoward. It appears to be using the host definition of long double (quite sensible to me, since its unsupported in device code, but we might want to inspect a structure that happens to be carrying that), and interpreting the bytes it finds according to that. Since use of long double in device code is unsupported, that might actually be the most useful behavior for cuda-gdb to manifest.

Perhaps it is a double[2], each aligned to 16 bytes? But initialized with a double[2] with alignment 8 bytes? Or just cuda-gdb assumes the wrong alignment, when indexing?

For whatever reason I don’t get a warning, even if I invoke nvcc with just the -c flag and some -I flags and a file extension of .cu. But undefined behavior is just that and not worth perusing.

Thinking about this maybe cuda-gdb gets messed up because the debugging symbol information gets complicated with different sizes of the variable on host and GPU, and if I remember right there’s only one debugger symbol table and no way to encode two different definitions of the same variable. It looks like nvcc is mapping the variable in storage and initializing it correctly as 8-byte floats, it’s just the debug symbols that get confused

I brought this up as confusing behavior that could trip someone up, but since
it’s specified as undefined behavior then just close this out…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.