Additional requirements for coalesced reads/writes structs must map to built-in type

I didn’t see this explicitly in the CUDA Programming Guide, so I thought I’d post it here. It seems that if you want global memory reads/writes of a struct to be coalesced, the fields of the struct must map exactly to one of the built-in (vectorized) types. So, for example, none of the following structs will succeed in coalesced memory ops, even if they are 16-byte- and half-warp-aligned.

typedef __align__(16) struct

{

    char a,b,c,d;

    char e,f,g,h;

    char i,j,k,l;

    char m,n,o,p;

} t1;

typedef __align__(16) struct

{

    char a,b,c,d;

    int e;

    int f;

    int g;

} t2;

typedef __align__(16) struct

{

    char a,b;

    short c;

    int d;

    int e;

} t3;

typedef __align__(16) struct

{

    short a,b;

    int c;

    int d;

    int e;

} t4;

typedef __align__(16) struct

{

    short a,b;

    int c;

    int d;

} t5;

This isn’t too surprising, but it should really be explicit somewhere in the documentation. Have I got the rule correct, or is it more subtle?

You can test it yourself with the attached modification of MisterAnderson42’s bandwidth test code.
bw_test_struct.zip (1.55 KB)

Have you tried reading them as float4*?

Reading/writing speed can be fine as float4, but the casting and byte manipulations are problematic. I’ve had to be careful about converting, for example, between ints and chars. Byte order is different on x86 host and the device, I think. I’m sure it can be done properly, but requires a bit of care.

Of course, my point was just to make sure people know the limitation, not to say that it’s insurmountable.

To expand on that a bit, I seem to recall that casting works consistently between emulation and device mode if you go between uchar4 and int, but not if you use an array of chars.

So I have tested out doing the reads as a built-in type and then manually casting and assigning the struct fields. It works just fine, and coalesces properly. Given that, I think this should be considered a missing compiler feature (I won’t go so far as to call it a bug) rather than a hardware limitation. It would be great if the compiler was doing this for us behind the scenes, so we could read any properly-aligned structs of the right size.

BTW, I did the read as a uint4 rather than a float4, which crashed the compiler. To get the chars, I cast one of the fields of the uint4 to a uchar4 and did the manual assignments. If shorts and chars are involved, you can one field of the uint4 to a ushort2, then cast one field of the ushort2 to a uchar4. I’m not sure if there is some underlying issue with going between integer and floating point types. In principle there shouldn’t be, I think, but it may just be some issue in the current compiler code.

Have you tried looking at the .ptx code generated from your source? It’s possible that the compiler generates code that writes your structure of 16 elements in several steps, as fields become available. This would break coalescing. Please confirm if that’s what’s happening.

Paulius

I’m not comletely sure what you mean by “as fields become available”. I’m attaching the .ptx below. It is indeed breaking the reads/writes into several steps, which breaks coalescing, as you say. The question at this point is whether the compiler could do something a bit smarter. (BTW, I’ve gone ahead and filed a bug/feature request for this as well).
bw_test_struct.ptx.gz (5.07 KB)

Thanks for filing the bug. I’ve run into a case where I had to coax the compiler into not breaking up a uchar4 store. I’ll see about linking the two bug reports.

Here’s what I meant by the fields becoming available. Some of the fields in a structure will have their values computed before others. Say, fields a,b,c,d get computed first. Thus, sometimes the compiler decides to store the fields that are ready, while computing the results for the remaining ones. This would be a good optimization if coalescing wasn’t being broken - registers are being conserved and writes are being overlapped with arithmetic.

Paulius