float4 in a register?

surfdabbler · February 5, 2015, 5:15am

I have a kernel that loads a pair of points, updates the positions and writes the points back to memory. Here’s the first version of the code…

float p1x, p1y, p1z;
float p2x, p2y, p2z;

p1x = d_points[ix1].x;
p1y = d_points[ix1].y;
p1z = d_points[ix1].z;
p2x = d_points[ix2].x;
p2y = d_points[ix2].y;
p2z = d_points[ix2].z;

... calculations ...

d_points[ix1].x = p1x;
d_points[ix1].y = p1y;
d_points[ix1].z = p1z;
d_points[ix2].x = p2x;
d_points[ix2].y = p2y;
d_points[ix2].z = p2z;

Now, obviously this method of copying is inefficient, because the X,Y,Z coordinates are read one at a time from global memory, and because the points can’t be sorted, the memory access cannot be coalesced. So, I tried using float4, with the following code…

float4 p1;
float4 p2;

p1 = d_points[ix1];
p2 = d_points[ix2];

... calculations ...

d_points[ix1] = p1;
d_points[ix2] = p2;

Then CUDA can copy the full XYZ in one access, and only 25% of the bandwidth is wasted. But, this 2nd version runs about 20% slower, which is about the same performance if I don’t copy to register memory at all, and just calculate directly on the global memory.

It seems like the float4 is being stored in global memory, and in fact changing p1 and p2 to float4[100] makes no difference to performance.

I have already turned off the -G (debug) option in the CUDA compiler, and also turned up the compute capability option from 20 to 35. Max Used Register is set to 0. Is there some other option I need to set to get float4 into register memory, or something else I’m missing?

MutantJohn · February 5, 2015, 5:47am

It might be because float4 is 16 bytes while 3 floats would be 12 bytes. Technically, you’d have memory to deal with if you used float4’s.

surfdabbler · February 5, 2015, 6:14am

Three separate floats results in 3 memory accesses, because they are loaded one at a time. However, a float 4 will fit into a single memory access of 128 bits, so should be possible in a single memory access, as noted in lots of examples on the net.

I just did some profiling on the float4 kernel, and it is throwing up more Global Memory Access issues than the old one…

Global load L2 Transactions/Access = 31.8, Ideal Transactions/Access = 4 [3684 L2 transactions for 116 total executions.

This message occurs 16 times - four times each for lines 4,5,9 and 10 in the 2nd code example above. It’s like the compiler is expanding it into four seperate operations, like this…

p1.x = d_points[ix1].x;
p1.y = d_points[ix1].y;
p1.z = d_points[ix1].z;
p1.w = d_points[ix1].w;

Hmm, yes, I just expanded it like this in the code, and it made no difference. Taking out the w made it a little faster, but still a long way behind the original float code.

little_jimmy · February 5, 2015, 8:49am

“Three separate floats results in 3 memory accesses”

i would think this is conditional on your declaration of d_points, used as d_points[ix1].x
3 memory accesses per thread would be ideal, and i am not sure whether your implementation would attain the ideal

here:

d_points[ix1] = p1;
d_points[ix2] = p2;

the ideal is 2 memory accesses per thread, and is again conditional on the way ix2 is calculated

surfdabbler · February 5, 2015, 9:30am

Final update - did more testing with compiler debug info turned off (had it accidentally turned back on before - my mistake), and the float4 (one 128-bit memory access per point) is now faster than individual floats (three 32-bit accesses per point), by about 20%. With debug info turned on, it’s the other way around. Weird.

Anyway, resolved now, thank you.

Topic		Replies	Views
Reading from global memory to registers in a fast way CUDA Programming and Performance	10	2091	November 15, 2021
Register / Shared memory question memory copy max performance CUDA Programming and Performance	6	8154	September 13, 2009
registers vs global memory kernel comparison CUDA Programming and Performance	3	794	September 2, 2011
Understanding different register counts for the same kernel CUDA Programming and Performance	3	914	December 13, 2019
Increasing register usage without decreasing occupancy drops speed dramatically CUDA Programming and Performance	3	968	May 24, 2011
smart ideas for an interesting problem CUDA Programming and Performance	21	9544	December 10, 2008
Maximum optimization settings CUDA Programming and Performance	7	6923	June 21, 2008
Why inline instead of register? using register memory CUDA Programming and Performance	4	1112	July 4, 2009
Is float3 as fast as float4? CUDA Programming and Performance	11	383	July 16, 2024
Problem with reducing registers CUDA Programming and Performance	6	621	June 22, 2011

float4 in a register?

Related topics