I’d like to have the data structure with four double numbers (GTX280 card), however there is no built-in double4 vector type on CUDA, thus I have to implement it by myself. I use the following alignment code and define the type structure GPU_qd
struct align(16) GPU_qd
I use Profiler to test the access it shows it’s still coalesced access. But it seems the performance is not very good. The time to access 4 M GPU_qd number is about 55 ms(no other operations, just read from one array to another array using kernel, rather than memcpy function), however if I access 16 M double number is only about 27ms. They have the same size, I think it should be similar. Does anyone knows the problem? And I wonder why NVIDIA only support up to double2 rather than double4, however float2 and float4 are both supported.