performance problem of aligned structure


I’d like to have the data structure with four double numbers (GTX280 card), however there is no built-in double4 vector type on CUDA, thus I have to implement it by myself. I use the following alignment code and define the type structure GPU_qd

struct align(16) GPU_qd
double2 d1;
double2 d2;

I use Profiler to test the access it shows it’s still coalesced access. But it seems the performance is not very good. The time to access 4 M GPU_qd number is about 55 ms(no other operations, just read from one array to another array using kernel, rather than memcpy function), however if I access 16 M double number is only about 27ms. They have the same size, I think it should be similar. Does anyone knows the problem? And I wonder why NVIDIA only support up to double2 rather than double4, however float2 and float4 are both supported.



okay… I found one problem, the number of coalesced access it seems doubled… so I think the built-in vector do have some optimization, I really hope CUDA can support double4

You will only see coalesced accesses on GT200, there memory access rules have changed. There is no such thing as uncoalesced access anymore.

I believe the memory controller supports mem-transfers of up to 128 bits (= 4 floats / 2 doubles), so that could be the reason double4 is not supported.