Expanding 1D code to 3D code, so making use of float4 would be useful for coalescing.
But if I only wish to manipulate the x,y and z component of float4 can I just refer to them, or do I also need to refer to the w component even though it is redundant?
For example, what happens with the following code?
float4 pos;
pos.x+=2;
pos.y+=3;
pos.z+=1;
Would this be coalesced? If not would I need to add a statement referring to pos.w for coalescing?
And in order for the compiler to interpret the three (or four) statements as coalesced do they need to be sequential? i.e. as above or can statements be inserted in between, such as
float4 pos;
int c;
pos.x+=2;
c+=10;
pos.y+=3;
pos.z+=1;
Also, the compiler tells me that operations such as adding two float4s cannot be done, such as
float4 pos1,pos2,pos3;
pos3 = pos1 + pos2;
and that if adding two float4s the addition must be written explicitly, such as
You can also use the “volatile” keyword to force the compiler to read all the components even though you only use some of them. e.g.:
volatile float4 pos = posArray[i];
pos.x += 2;
It shouldn’t matter which order you access the components in.
CUDA doesn’t provide operations on the vector types out of the box, but you can define them yourself or use the “cutil_math.h” header included in the SDK.
Upon further reflection, I realized that this question is dumb - I got confused by the volatile keyword. Since pos and posArray are separate variables, writing to pos will have no affect on posArray. If pos is in registers, then writing to pos just writes to those registers. pos being volatile just means that those writes cannot be optimized out - the value in the register has to be modified - but this does not involve memory.
my code load data from global mem to shm is slow, can i use vector load like this?
extern shared float sdata;
int total_id = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x;
if (tid % 4 == 0) reinterpret_cast<float*>(sdata)[tid] = reinterpret_cast<float*>(in)[total_id];
__syncthreads(); …
but sdata not right, some is 0, why? and is it any better using vector load like this?
On the GPU, data must be naturally aligned. That means access to a float4 stored in memory must be 16-byte aligned. When data is accessed without the required alignment, the result is undefined. This means that simply casting a float* into a float4* and then dereferencing the latter will not deliver the desired result unless proper alignment of the float4* pointer is guaranteed (that is the job of the programmer).
FWIW, the badly formatted code in the question does not actually show the use of float4, which could be another reason that not all the data is being read, assuming this is a true copy of the actual code used.