float3 support details witch operations does float3 support

Hi@all,

  1. Question: does float3 support addition etc.

float3 is is among others mentioned as “built in vector type”. Is it possible to write something like

[codebox]

int main ()

{

float3 px = make_float3(1.0,2.0,3.0);

float3 px2 = make_float3(1.0,2.0,3.0);

float3 px3 = px + px2;  

}[/codebox]

when i do so i get the following error report:

float3> nvcc float3.cu

float3.cu(53): error: no operator “+” matches these operands

        operand types are: float3 + float3

1 error detected in the compilation of “/tmp/tmpxft_000014bf_00000000-4_float3.cpp1.ii”.

do you have to include something, did i make a mistake, or is it simply not possible?

2.Question

If 1. Question is answered with yes, and float3 supports some basic operations, were can i get details, for example a wjole list of them an how do they work exactly.

3.Question

If 1. Question is answered with yes, does float3 + float3 differ in speed from

vecAdd<<<1,3>>>(a,b,c);

with vecAdd is something like

[codebox]

device void vecAdd(float* a, float* b, float* out)

{

int i = threadIdx.x;

out[i] = a[i]+b[i];

}[/codebox]

Thanks to everybody taking his/her precious time to answer these rookie questions!

System:

Cuda 2.2

Linux suse 11.1

GPU: GTX 275

Intel E6600@ 2GB RAM

Ad 1. There are no operators defined for float#/int# types, however you should be able to implement them.

Ad 3. You should try to write bigger kernels. vecAdd<<<1,3>>>(a,b,c) would introduce too much overhead for launching and passing parameters Device<->Host, that it would probably be much much slower!
For simple vector operations you can use SIMD instructions from CPU.

nvcc supports the operator overloading syntax from C++, so you can write functions like this:

__device__ float3 operator+(const float3 &a, const float3 &b) {

  return make_float3(a.x+b.x, a.y+b.y, a.z+b.z);

}

Hi again,

thanks for the quick answers. so lets they you have to work with a million pixels, and each pixel consists of 3 values. do you get a better performance when you handel them in one big float vector like that[codebox]device void vecAdd(float* a, float* b, float* res)

{

int i = (blockIdx.x*blockDim.x) +threadIdx.x;

res[i] = a[i]+b[i];

}

vecAdd<<<5860,512>>>(a,b,c);[/codebox]

or a float3 vector with code something like that (not sure if that code is working but i think you get the idea)

[codebox]device float3 operator+(const float3 &a,const float3 &b) {

return make_float3(a.x+b.x, a.y+b.y, a.z+b.z);

}

global void ImageAdd(const float3 a,const float3 b, float3* out)

{

int i = (blockIdx.x*blockDim.x) + threadIdx.x;	

*out[i] = *a[i]+*b[i];	

}

ImageAdd<<<1954,512>>>(a,b,c);

[/codebox]

Best way to answer your question - do some testing and check which one is better.
You could also check out using float4 instead, because there is a hardware instruction for reading 4 aligned floats, while I believe there is none for 3.

On the other hand, in some more advanced algorithms where you need to load pixel data into shared memory, I would advice not using float4 as this is likely to cause bank conflicts!

If either method works for your problem, then I would give preference to the float version, rather than the float3. As was mentioned, there is no 96 bit load instruction, so the compiler has to split a float3 read into a 64 bit and 32 bit read instruction. On the GTX 200 series cards this wastes some bandwidth, but on the older (and the mobile) GPUs, the memory controller completely uncoalesces the memory transactions. This creates a large performance loss. There are workarounds, but if you can deal with the floats individually, then that is easier.

Is there any specific advantage/disadvantage using float3 instead of 3 floats? or more general what options do built in vectors over except that you can can access their componets via .x/y…

There is only an advantage (some might say a requirement) if you need a thread to access all three components (x,y,z). For example, if each float3 represents a 3D vector in space, and your kernel is performing some kind of rotation operation, you need to load all three components of the float3 into the same thread to do the math.

One might also consider using float4 with the last variable unused. That makes for a bigger memory footprint but is coalesced and may actually work faster.