reduce memory access by combine two arrays into one

Hi, I have ported my algorithm into GPU and it works great. However, I am still wondering if I can optimize the cuda code to make it run faster. My algorithm is memory bound and I have already make sure I have coalesed memory access.
my algorithm will access two float arrays with the same size N. A and B. And I will always use same index in my calculation. Say result[i]=A[i]+B[i].

Now, since cuda allows me to coalescing 128 bytes. Which means a thread can access 2 floats at the same time. So can I store my data in a struct
struct {
float A;
float B;
} mydataStructure;
and get half memory instructions? The problem is, as my A update, how do I update this strucuture fast? Is there any tricks to fast copy my A array to the struct’s A component? Or I have to write another cuda kernel just to perform this operation?


float2 would be the easiest I think. Take a look here for further performance info:


It seems using float2 only have very marginal bandwidth boost compare with using float. That is out of my expectation. I would expect to get around 2-fold since each memory access instruction will get twice data. Well, I guess it is now bounded by the memory bus speed rather than the latency. Indeed, the result of 295 and 275 shows the bandwidth achieved is around 80% of the theoretical bandwidth limit. So this kernel is not latency bound but bandwidth bound.

I am guessing the latency is hided efficiently by thread scheduling. Maybe we should add share memory used per thread to limit the thread size per block or reduce the problem size? So that we can see the raw performance of reading float2 and float4?

Back to my problem. My kernel is also memory access bound, and right now my kernel can achieve memory read throughput of 56GB and write throughput of 19GB, give me overall throughput of 75GB/s. So I doubt float2 will help me. Besides, my question is about how to update this float2 structure from 2 float arrays efficiently (actually only one array needs to be updated frequently). I am wondering if there is some special memcpy instructions or some smart way to achieve it.