You are not following alignment rules in your last code sample and you are reading 1 byte extra which may obviously crash the kernel if that last byte appears in a different memory page you have not allocated and not supposed to read from. Also, according to the docs compiler should still do 3 loads for you because it can’t ensure the address is aligned, and if it doesn’t - hardware should take the penalty anyway.
You could easily write a function to fetch 4 of your int24 buddies using 3 4byte loads (followed by shift/masks). This way you get nice coalesced loads of your packed data.
You are not following alignment rules in your last code sample and you are reading 1 byte extra which may obviously crash the kernel if that last byte appears in a different memory page you have not allocated and not supposed to read from. Also, according to the docs compiler should still do 3 loads for you because it can’t ensure the address is aligned, and if it doesn’t - hardware should take the penalty anyway.
You could easily write a function to fetch 4 of your int24 buddies using 3 4byte loads (followed by shift/masks). This way you get nice coalesced loads of your packed data.