float4 in a register?

I have a kernel that loads a pair of points, updates the positions and writes the points back to memory. Here’s the first version of the code…

float p1x, p1y, p1z;
float p2x, p2y, p2z;

p1x = d_points[ix1].x;
p1y = d_points[ix1].y;
p1z = d_points[ix1].z;
p2x = d_points[ix2].x;
p2y = d_points[ix2].y;
p2z = d_points[ix2].z;

... calculations ...

d_points[ix1].x = p1x;
d_points[ix1].y = p1y;
d_points[ix1].z = p1z;
d_points[ix2].x = p2x;
d_points[ix2].y = p2y;
d_points[ix2].z = p2z;

Now, obviously this method of copying is inefficient, because the X,Y,Z coordinates are read one at a time from global memory, and because the points can’t be sorted, the memory access cannot be coalesced. So, I tried using float4, with the following code…

float4 p1;
float4 p2;

p1 = d_points[ix1];
p2 = d_points[ix2];

... calculations ...

d_points[ix1] = p1;
d_points[ix2] = p2;

Then CUDA can copy the full XYZ in one access, and only 25% of the bandwidth is wasted. But, this 2nd version runs about 20% slower, which is about the same performance if I don’t copy to register memory at all, and just calculate directly on the global memory.

It seems like the float4 is being stored in global memory, and in fact changing p1 and p2 to float4[100] makes no difference to performance.

I have already turned off the -G (debug) option in the CUDA compiler, and also turned up the compute capability option from 20 to 35. Max Used Register is set to 0. Is there some other option I need to set to get float4 into register memory, or something else I’m missing?

It might be because float4 is 16 bytes while 3 floats would be 12 bytes. Technically, you’d have memory to deal with if you used float4’s.

Three separate floats results in 3 memory accesses, because they are loaded one at a time. However, a float 4 will fit into a single memory access of 128 bits, so should be possible in a single memory access, as noted in lots of examples on the net.

I just did some profiling on the float4 kernel, and it is throwing up more Global Memory Access issues than the old one…

Global load L2 Transactions/Access = 31.8, Ideal Transactions/Access = 4 [3684 L2 transactions for 116 total executions.

This message occurs 16 times - four times each for lines 4,5,9 and 10 in the 2nd code example above. It’s like the compiler is expanding it into four seperate operations, like this…

p1.x = d_points[ix1].x;
p1.y = d_points[ix1].y;
p1.z = d_points[ix1].z;
p1.w = d_points[ix1].w;

Hmm, yes, I just expanded it like this in the code, and it made no difference. Taking out the w made it a little faster, but still a long way behind the original float code.

“Three separate floats results in 3 memory accesses”

i would think this is conditional on your declaration of d_points, used as d_points[ix1].x
3 memory accesses per thread would be ideal, and i am not sure whether your implementation would attain the ideal

here:

d_points[ix1] = p1;
d_points[ix2] = p2;

the ideal is 2 memory accesses per thread, and is again conditional on the way ix2 is calculated

Final update - did more testing with compiler debug info turned off (had it accidentally turned back on before - my mistake), and the float4 (one 128-bit memory access per point) is now faster than individual floats (three 32-bit accesses per point), by about 20%. With debug info turned on, it’s the other way around. Weird.

Anyway, resolved now, thank you.