Hi, I have ported my algorithm into GPU and it works great. However, I am still wondering if I can optimize the cuda code to make it run faster. My algorithm is memory bound and I have already make sure I have coalesed memory access.
my algorithm will access two float arrays with the same size N. A and B. And I will always use same index in my calculation. Say result[i]=A[i]+B[i].
Now, since cuda allows me to coalescing 128 bytes. Which means a thread can access 2 floats at the same time. So can I store my data in a struct
and get half memory instructions? The problem is, as my A update, how do I update this strucuture fast? Is there any tricks to fast copy my A array to the struct’s A component? Or I have to write another cuda kernel just to perform this operation?