Compiler Error for Byte To Int Conversion

Hi,

I’m trying to take a 8 character byte array and convert it to a 64 bit integer. The code I have below compiles and works fine in non-cuda c++, but does not compile in cuda nvcc. If there is another way to do this conversion please let me know, I know some functions I tried just did not work because they were not device functions.

I receive this error when I compile the code:

1>### Assertion failure at line 1923 of …/…/be/cg/cgemit.cxx:
1>### Compiler Error in file C:/Users/john/AppData/Local/Temp/tmpxft_00002238_00000000-9_DeviceFunctions.cpp3.i during Assembly phase:
1>### incorrect register class for operand 0

Here is the code I’m compiling. All I’m really doing is changing the endian of the bytes and referring to the switched array as an unsigned 64 bit integer.

device void dConvertCAToInt64(unsigned __int64* intInt64Out, unsigned char* chrCAIn)
{
unsigned char chrCAOutRev[8];
unsigned __int64* p_intInt64 = (unsigned __int64 *)&chrCAOutRev;
chrCAOutRev[0] = chrCAIn[7];
chrCAOutRev[1] = chrCAIn[6];
chrCAOutRev[2] = chrCAIn[5];
chrCAOutRev[3] = chrCAIn[4];
chrCAOutRev[4] = chrCAIn[3];
chrCAOutRev[5] = chrCAIn[2];
chrCAOutRev[6] = chrCAIn[1];
chrCAOutRev[7] = chrCAIn[0];
*intInt64Out = *p_intInt64; // <---- This is the specific line that is causing the error!!!
}

Any info is appreciated.

Thanks.

Try doing the cast using the [font=“Courier New”]reinterpret_cast[/font] operator.

The [font=“Courier New”]__byte_perm()[/font] comes handy for that. Check appendix C.2.3 of the (CUDA 3.2) Programming Guide.

This code might also be easier on the compiler if you used bit operations to build your int64, rather than type-punning:

*intInt64Out =    (p_intInt64chrCAIn[0] << 56) ||  (p_intInt64chrCAIn[1] << 48) ||  (p_intInt64chrCAIn[2] << 40) ||  (p_intInt64chrCAIn[3] << 32) 

                       ||  (p_intInt64chrCAIn[4] << 24) ||  (p_intInt64chrCAIn[5] << 16) ||  (p_intInt64chrCAIn[6] << 8) ||  p_intInt64chrCAIn[7];

This may also have better performance, as your original function will write to an 8-byte array in local memory before reading it back again. Bit-shifting and bitwise OR can be done entirely in registers after the input data is read.

(Normally, a local array indexed with constants as you have written it could also be placed entirely into registers. However, I don’t think the instruction set offers any way to reinterpret_cast registers in this manner. It’s possible that this might be the reason the compiler threw an assert warning when it discovered that it had backed itself into a corner.)

I like this concept but never really worked with bit operators. I’ve been testing this and it is not acting as expected. I’m not running this code on the device, just on the cpu because it is easier to debug…

unsigned char chrTest1[8] = { 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08 };

unsigned __int64* pintA = (unsigned __int64*)&chrTest1;

unsigned char chrTest2[8] = { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 };

unsigned __int64* pintB = (unsigned __int64*)&chrTest2;

if I do this: *pintB = chrTest1[0] << 8;

chrTest2 contains: 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 <- That is what I would expect

if I do this: *pintB = chrTest1[0] << 16;

chrTest2 contains: 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00 <- That is what I would expect

if I do this: *pintB = chrTest1[0] << 24;

chrTest2 contains: 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00 <- That is what I would expect

if I do this: *pintB = chrTest1[0] << 32;

chrTest2 contains: 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 <- Seems like it wraps around at 32 bits, both the char array and int variables are 64 bits.

Note that

[list=1]

You need to cast your data to the larger type before doing the shift.

Shifts for more bits than the word is wide are undefined and

CPU and GPU behave different in that case: the CPU wraps at 32bit, the GPU doesn’t.

Byte-reversal can be expressed in any of a number of ways, e.g.

typedef unsigned long long uint64;

typedef unsigned int       uint32;

__device__ void rev1(uint64 *in, uint64 *out)

{

    *out = (*in & 0x00000000000000ff) << 56 |

           (*in & 0x000000000000ff00) << 40 |

           (*in & 0x0000000000ff0000) << 24 |

           (*in & 0x00000000ff000000) <<  8 |

           (*in & 0x000000ff00000000) >>  8 |

           (*in & 0x0000ff0000000000) >> 24 |

           (*in & 0x00ff000000000000) >> 40 |

           (*in & 0xff00000000000000) >> 56;

}

__device__ void rev2(uint64 *in64, uint64 *out64)

{

    uchar4* in = (uchar4*)in64;

    uchar4* out = (uchar4*)out64;

out[0].x = in[1].w;

    out[0].y = in[1].z;

    out[0].z = in[1].y;

    out[0].w = in[1].x;

    out[1].x = in[0].w;

    out[1].y = in[0].z;

    out[1].z = in[0].y;

    out[1].w = in[0].x;

}

__device__ void rev3(uint64 *in, uint64 *out)

{

    *out = (uint64) __byte_perm((uint32)*in, 0, 0x0123) << 32

           | __byte_perm((uint32)(*in >> 32), 0, 0x0123);

}

[font=“Courier New”]rev1()[/font] would be kind of a standard way of doing this in C. [font=“Courier New”]rev2()[/font] appears to give the shortest machine code on compute capability 1.x devices, while [font=“Courier New”]rev3()[/font] gives the shortest code on 2.x devices where [font=“Courier New”]__byte_perm()[/font] compiles to a single instruction.

EDIT: Note that [font=“Courier New”]rev2()[/font] does not allow to use the same pointer for the [font=“Courier New”]in[/font] and [font=“Courier New”]out[/font] arguments, so it might be better to use the slightly longer code

__device__ void rev2a(uint64 *in64, uint64 *out64)

{

    uchar4* in = (uchar4*)in64;

    uchar4* out = (uchar4*)out64;

    uchar4 in0 = in[0], in1 = in[1];

out[0].x = in1.w;

    out[0].y = in1.z;

    out[0].z = in1.y;

    out[0].w = in1.x;

    out[1].x = in0.w;

    out[1].y = in0.z;

    out[1].z = in0.y;

    out[1].w = in0.x;

}

which however compiles to the same number of instructions on 1.x devices and even is one instruction shorter on 2.x devices.

Thanks for this info. I think I have this working now.

How can I printf a 64-bit integer in device code? Seems to truncate the value somehow. I want to be sure my values are correct.

Thanks again.

unsigned long long var;

printf("%llx\n", var);

?

Don’t have a compute capability 2.x device at hand so I can’t check, but this should work.

Thanks I’ll give that a shot.

Well, your function works, the problem I have is that anytime/anywhere I try to cast my char array to a uint64, I get the assertion failed error. Since your examples take two uint64* as parameters, I need to convert my char[8] into a uint64*. Anywhere I do that in the code, I create the assertion failure.

I need to somehow covert that before sending it through the reversal function.

Thanks.

Ok, if you actually start from an array of bytes you are probably better off with a variant of [font=“Courier New”]rev2()[/font]:

__device__ void dConvertCAToInt64(unsigned __int64* out64, unsigned char* in)

{

    uchar4* out = (uchar4*)out64;

out[0].x = in[7];

    out[0].y = in[6];

    out[0].z = in[5];

    out[0].w = in[4];

    out[1].x = in[3];

    out[1].y = in[2];

    out[1].z = in[1];

    out[1].w = in[0];

}

Check however that the resulting order is what you expect.

A bit confused about your example. out is defined twice (in the parameters and as a uchar4*)

Thanks.

Oops, yes. I’ve corrected the mistake in the post.

Unless I’m not seeing your update, I’m still confused. out is defined twice as the uint64 in the parameters and a uchar4* in the function. Also out64 is not defined. These are the compiler errors I’m getting.

Thanks.

The parameter name was supposed to be out64, so that it can be used in setting out. I could fix this in my own post, but not in the copy of it in your post…

Thanks. That still gives the assertion failure.

Here is what I ended up doing for now, not the best performance wise…

device void dConvertCAToInt64(uint64* out, uchar* in)

{

*out = 0;

*out = *out + (in[0] * 256 * 256 * 256 * 256 * 256 * 256 * 256);

*out = *out + (in[1] * 256 * 256 * 256 * 256 * 256 * 256);

*out = *out + (in[2] * 256 * 256 * 256 * 256 * 256);

*out = *out + (in[3] * 256 * 256 * 256 * 256);

*out = *out + (in[4] * 256 * 256 * 256);

*out = *out + (in[5] * 256 * 256);

*out = *out + (in[6] * 256);

*out = *out + in[7];

{

This ought to be equivalent to a working version (I guess mine didn’t have the appropriate typecasts) bitshift and bitwise-or solution. I think the compiler is turning the multiplications by powers of 2 into left-shifts for you already, so I imagine this isn’t so bad.

This kind of error message

1>### Assertion failure at line 1923 of ../../be/cg/cgemit.cxx:

1>### Compiler Error in file C:/Users/john/AppData/Local/Temp/tmpxft_00002238_00000000-9_DeviceFunctions. cpp3.i during Assembly phase:

1>### incorrect register class for operand 0

indicates a problem that is internal to the compiler. If this problem is reproducable with CUDA 4.0, it would be helpful if you could file a bug, attaching a self-contained repro case, so the CUDA compiler team can have a look at this (and possibly suggest a workaround). Thank you for your help.