This half2 behavior needs attention

#include <iostream>
#include <stdio.h>
#include <math.h>
#include <cuda_fp16.h>
#include <iostream>
#include <cstdint>
#include <cstdlib>
#include <cstring>
using namespace std;

int nnz=6;

__global__ void check( half* x, int nnz){
    half2 *x2 = NULL;//(half2*)x;
    half2 zero = __float2half2_rn(0.0);
    half* ptr=NULL; 
    int res=0;
    half2 z= __float2half2_rn(25.0f);

    char* bytePtr = reinterpret_cast<char*>(x);
    bytePtr +=3*sizeof(half);  // adjust pointer
    ptr = reinterpret_cast<half*>(bytePtr);  
    half2 twin;

    for(int i=0;i<(1);i++){
        printf("low %f ", __half2float(__low2half(x2[i])));
        printf("high %f \n", __half2float(__high2half(x2[i])));
        half l = __low2half(x2[i]);
        half r = __high2half(x2[i]);
        //half2 twin = __halves2half2(l,r);
        //printf(" twin low %f -- ", __half2float(l));
        printf(" twin high %f\n ", __half2float(r));

int main()
    float x[nnz] = {1,2,3.0,4,5,6,};
    half *halfx = (half*)malloc(nnz*sizeof(half));

    for(int i=0;i<nnz;i++){
    half* dx;
    cudaMalloc((void**)&dx, nnz*sizeof(half));
    cudaMemcpy(dx, halfx, nnz*sizeof(half), cudaMemcpyHostToDevice);

    cudaError_t cudaStatus = cudaDeviceSynchronize();

    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize failed: %s\n", cudaGetErrorString(cudaStatus));


  • Here in this code, in the last 2 statements inside the kernel, if you run with both the printf, it shows error. But if you comment any one and leave the other. It works Fine.

  • I know this kind of aliasing of pointers in not recommended. But this time it is really required. What we did here is to point the half2 at the at the 4th half value manually. Problem is, then it seems to have weird behavior. It can not be written or updated (try doing something with x2, update or assign it to twin). Really need the discussion to make this work, or at least know how Nvidia actually implements the half2.

  • you can only seem to just read and print, but can not do anything even after extracting the half

char* bytePtr = reinterpret_cast<char*>(x);

assuming that the original x pointer passed to the kernel was allocated via e.g. cudaMalloc, then bytePtr is properly/naturally aligned.

bytePtr +=3*sizeof(half);  // adjust pointer

given that the original x pointer is properly/naturally aligned to half, this bytePtr is properly aligned to both its native type (char) and also half.

ptr = reinterpret_cast<half*>(bytePtr);  

ptr is properly/naturally aligned to its native type (half)


x2 is not properly/naturally aligned to half2 type, its native type. Therefore this pointer is incorrectly constructed and subsequent usage (i.e. dereferencing) is illegal in CUDA.

You’re getting confused by your own indexing (thinking about things with zero based indexing or one based indexing). In zero-based indexing (not the nomenclature you are using), the “4th” half value would be naturally aligned for a half2 pointer. But in one-based indexing (what you are actually using from a nomenclature standpoint, and what your code is actually doing) the 5th value would be the properly aligned one.

half values:      x   y  z  a b c d e f ...
zero-based:       0   1  2  3 4 5 6 7 8 ...
one-based:        1   2  3  4 5 6 7 8 9 ...
half2 aligned:    *      *    *   *   * ...

To avoid this, instead of this:

bytePtr +=3*sizeof(half);  // adjust pointer

you could have done this:

bytePtr +=2*sizeof(half);  // adjust pointer

or this:

bytePtr +=4*sizeof(half);  // adjust pointer
1 Like

Thank you for your reply
half indices(var x): 0 1 2 3 4 5 6 7 8
half2* y=(half2*)x makes (0,1), (2,3),(3,4), (5,6) as half2 pairs and can be accessed by y[0], y[1], y[2]…
I want to point a half2 pointer from index 3 or any odd index position in x, so (2,3) would be new half2, (4,5) would be new half2.

That is my question. How can I achieve it in Cuda, is it even possible?

Not possible in CUDA. I already linked to the relevant doc section indicating this. Dereferencing a half2 pointer (or any type pointer) in device code requires “natural” alignment of the pointer.

You would have to fall back to half pointer use, or else use the half2 pointer and pack/unpack adjacent quantities.

1 Like

Thank you again for such insight.
But what do you meant by " usehalf2 pointer and pack/unpack adjacent quantities."
it still will not give me half2 of index pairs (2,3), (4,5)?

I mean:

half2:     0       1       2
half:     (0,1)   (2,3)   (4,5)   

Suppose I have a half2 pointer *h2 pointing to element 0.

Suppose I want half elements 1 and 2, packed into a single half2 quantity.

You would retrieve half2 elements 0 and 1, unpack those two quantities into 4 half quantities, 0-3, and repack half quantities 1 and 2 into a new half2 quantity nh2:

half2 q1 = h2[0];
half2 q2 = h2[1];
half2 nh2 = {q1.y, q2.x};

But this would require additional memory allocation, right?
But I got your solution.

If nh2 is an “additional memory allocation” then yes it requires additional memory allocation.

I think this is not a sensible question without an exact use case, and I can’t extract an exact use case from your previous code.

you could do this:

half2 *h1 = ...; // properly defined
half2 *h2 = ...; // properly defined

h2[0] = {h1[0].y, h1[1].x};

Does that involve an extra memory allocation? I don’t know. You decide.

Thank you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.