Fetch 4 bytes from global memory

I need to save memory therefore I create integer 24bit consisting of 3 bytes char:

[codebox]

class Int24

{

protected:

unsigned char m_Internal[3];

public:

host device operator int() const

{

    return (m_Internal[2] << 16) | (m_Internal[1] << 8) | m_Internal[0];

}

}

[/codebox]

In kernel code I try to assigned integer type to a Int24 object It worked perfectly. However when I look at the assembly code It generated :

[codebox]

ld.global.u8 %r32, [%rd12+0];

ld.global.u8 %r33, [%rd12+1];

shl.b32 %r34, %r33, 8;

ld.global.u8 %r35, [%rd12+2];

[/codebox]

3 memory reads instead of 1 read of 4 bytes if I used normal integer.

Is there any solutions to ask global memory to read 4 bytes instead ? then I can just remove the last byte.

I tried this, but failed with an “unspecified launch failure” even though this worked in normal c++ code.

[codebox]

int z = *((int *)&y[iii]);

z&=16777215;

[/codebox]

( y is an array of Int24 )

Thanks in advance

You are not following alignment rules in your last code sample and you are reading 1 byte extra which may obviously crash the kernel if that last byte appears in a different memory page you have not allocated and not supposed to read from. Also, according to the docs compiler should still do 3 loads for you because it can’t ensure the address is aligned, and if it doesn’t - hardware should take the penalty anyway.

You could easily write a function to fetch 4 of your int24 buddies using 3 4byte loads (followed by shift/masks). This way you get nice coalesced loads of your packed data.

You are not following alignment rules in your last code sample and you are reading 1 byte extra which may obviously crash the kernel if that last byte appears in a different memory page you have not allocated and not supposed to read from. Also, according to the docs compiler should still do 3 loads for you because it can’t ensure the address is aligned, and if it doesn’t - hardware should take the penalty anyway.

You could easily write a function to fetch 4 of your int24 buddies using 3 4byte loads (followed by shift/masks). This way you get nice coalesced loads of your packed data.