I have a kernel that loads a pair of points, updates the positions and writes the points back to memory. Here’s the first version of the code…
float p1x, p1y, p1z;
float p2x, p2y, p2z;
p1x = d_points[ix1].x;
p1y = d_points[ix1].y;
p1z = d_points[ix1].z;
p2x = d_points[ix2].x;
p2y = d_points[ix2].y;
p2z = d_points[ix2].z;
... calculations ...
d_points[ix1].x = p1x;
d_points[ix1].y = p1y;
d_points[ix1].z = p1z;
d_points[ix2].x = p2x;
d_points[ix2].y = p2y;
d_points[ix2].z = p2z;
Now, obviously this method of copying is inefficient, because the X,Y,Z coordinates are read one at a time from global memory, and because the points can’t be sorted, the memory access cannot be coalesced. So, I tried using float4, with the following code…
float4 p1;
float4 p2;
p1 = d_points[ix1];
p2 = d_points[ix2];
... calculations ...
d_points[ix1] = p1;
d_points[ix2] = p2;
Then CUDA can copy the full XYZ in one access, and only 25% of the bandwidth is wasted. But, this 2nd version runs about 20% slower, which is about the same performance if I don’t copy to register memory at all, and just calculate directly on the global memory.
It seems like the float4 is being stored in global memory, and in fact changing p1 and p2 to float4[100] makes no difference to performance.
I have already turned off the -G (debug) option in the CUDA compiler, and also turned up the compute capability option from 20 to 35. Max Used Register is set to 0. Is there some other option I need to set to get float4 into register memory, or something else I’m missing?