Bitfields are always complex and dangerous tools to use! You’ll likely run into problems.
As a guess, your problem is not CUDA related, it’s making an assumption that an array of structs will be packed tightly even when the structs are an odd size smaller than a word.
This is extremely unlikely to work in practice… think of how the compiler would have to access an element, it’d have to do integer math to figure out the phase of the struct, read from either one or two words based on that phase, patch the results back together with shifts to re-align in, THEN access it. Ugh. So likely what a compiler would do is pad the struct to a whole word length when accessing by an array.
Again, not a CUDA specific issue. The C standard says that bit fields are packed into implementation-dependent “storage units” which are likely 4-byte words in this case.
I’d say don’t try to use the complexity and ugliness of bitfields at all. If tight packing is really crucial, deal with a char array and assemble the data yourself with shifts and reads.
But also remember in CUDA words are the preferable quantum memory unit, not chars.
If you just use an int array, you may have 25% wasted space, but it will be easy, clean, portable, and efficient.
I’ll reiterate what SPWorley said…you don’t want to do 24-bit integers on CUDA. I believe the smallest ‘atomic’ size (I hesitate saying that, since with CUDA, that word has a totally different meaning) is the 32-bit integer. Also, keep in mind that you’ll (probably) get better performance by ‘wasting’ that extra space, but having more coalesced memory reads.
If you absolutely want/need to use the 24-bit ints, I might go with the approach of using a 96-bit struct (3x32-bit integers), and then breaking out the 4x24-bit integers once the data has been loaded inside the thread.