Using float4

chrismc · October 28, 2008, 9:32am

Expanding 1D code to 3D code, so making use of float4 would be useful for coalescing.

But if I only wish to manipulate the x,y and z component of float4 can I just refer to them, or do I also need to refer to the w component even though it is redundant?

For example, what happens with the following code?

float4	 pos;

pos.x+=2;

pos.y+=3;

pos.z+=1;

Would this be coalesced? If not would I need to add a statement referring to pos.w for coalescing?

And in order for the compiler to interpret the three (or four) statements as coalesced do they need to be sequential? i.e. as above or can statements be inserted in between, such as

float4	 pos;

int		  c;

pos.x+=2;

c+=10;

pos.y+=3;

pos.z+=1;

Also, the compiler tells me that operations such as adding two float4s cannot be done, such as

float4   pos1,pos2,pos3;

pos3 = pos1 + pos2;

and that if adding two float4s the addition must be written explicitly, such as

float4   pos1,pos2,pos3;

pos3.x = pos1.x + pos2.x;

pos3.y = pos1.y + pos2.y;

pos3.z = pos1.z + pos2.z;

If this is true when will float4 addition be available? I am using a C870.

Simon_Green · October 28, 2008, 11:26am

You can also use the “volatile” keyword to force the compiler to read all the components even though you only use some of them. e.g.:

volatile float4 pos = posArray[i];

pos.x += 2;

It shouldn’t matter which order you access the components in.

CUDA doesn’t provide operations on the vector types out of the box, but you can define them yourself or use the “cutil_math.h” header included in the SDK.

stencil · August 19, 2009, 3:25am

You can also use the “volatile” keyword to force the compiler to read all the components even though you only use some of them. e.g.:
volatile float4 pos = posArray[i];

pos.x += 2;
It shouldn’t matter which order you access the components in.

CUDA doesn’t provide operations on the vector types out of the box, but you can define them yourself or use the “cutil_math.h” header included in the SDK.

Would this cause every usage of pos to result in a memory access? For example will

volatile float4 pos = posArray[i];

pos.x = 2;

pos.y = 3;

pos.z = 4

Be 3 (or maybe even 4) coalesced memory accesses?

Edit: changed example to make it simpler

stencil · August 20, 2009, 7:43am

Would this cause every usage of pos to result in a memory access? For example will
volatile float4 pos = posArray[i];

pos.x = 2;

pos.y = 3;

pos.z = 4
Be 3 (or maybe even 4) coalesced memory accesses?

Edit: changed example to make it simpler

Upon further reflection, I realized that this question is dumb - I got confused by the volatile keyword. Since pos and posArray are separate variables, writing to pos will have no affect on posArray. If pos is in registers, then writing to pos just writes to those registers. pos being volatile just means that those writes cannot be optimized out - the value in the register has to be modified - but this does not involve memory.

2544995178 · April 3, 2024, 6:25am

my code load data from global mem to shm is slow, can i use vector load like this?
extern shared float sdata;
int total_id = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x;
if (tid % 4 == 0) reinterpret_cast<float*>(sdata)[tid] = reinterpret_cast<float*>(in)[total_id]；
__syncthreads(); …
but sdata not right， some is 0, why? and is it any better using vector load like this?

njuffa · April 3, 2024, 7:20am

On the GPU, data must be naturally aligned. That means access to a float4 stored in memory must be 16-byte aligned. When data is accessed without the required alignment, the result is undefined. This means that simply casting a float* into a float4* and then dereferencing the latter will not deliver the desired result unless proper alignment of the float4* pointer is guaranteed (that is the job of the programmer).

FWIW, the badly formatted code in the question does not actually show the use of float4, which could be another reason that not all the data is being read, assuming this is a true copy of the actual code used.

Topic		Replies	Views
Coalesced memory access CUDA Programming and Performance	3	3085	January 20, 2009
Float4 must read adjacent element? Can we modify it for coalesced reading? CUDA Programming and Performance	7	853	May 11, 2022
Coalesced vs non-coalesced in reduction example Why float4 is not coalesced CUDA Programming and Performance	1	1393	October 13, 2008
Quick question about memory coalescence CUDA Programming and Performance	5	5672	May 5, 2008
Is it possible to avoid optimization? Compiler breaks memory coalescing. CUDA Programming and Performance	7	3139	March 14, 2008
Packed datatype in an union - 'volatile' keyword necessary ? CUDA Programming and Performance	3	1586	September 22, 2016
Reading from float4 corrupts memory CUDA Programming and Performance cuda	18	53	October 3, 2024
nvcc horribly breaking float4 read CUDA Programming and Performance	5	9653	June 25, 2008
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4072	October 15, 2008
Does CUDA support vectorized instruction for += and atomicAdd? CUDA Programming and Performance	2	414	March 24, 2024

Using float4

Related topics