float3.cu(53): error: no operator “+” matches these operands
operand types are: float3 + float3
1 error detected in the compilation of “/tmp/tmpxft_000014bf_00000000-4_float3.cpp1.ii”.
do you have to include something, did i make a mistake, or is it simply not possible?
2.Question
If 1. Question is answered with yes, and float3 supports some basic operations, were can i get details, for example a wjole list of them an how do they work exactly.
3.Question
If 1. Question is answered with yes, does float3 + float3 differ in speed from
vecAdd<<<1,3>>>(a,b,c);
with vecAdd is something like
[codebox]
device void vecAdd(float* a, float* b, float* out)
{
int i = threadIdx.x;
out[i] = a[i]+b[i];
}[/codebox]
Thanks to everybody taking his/her precious time to answer these rookie questions!
Ad 1. There are no operators defined for float#/int# types, however you should be able to implement them.
Ad 3. You should try to write bigger kernels. vecAdd<<<1,3>>>(a,b,c) would introduce too much overhead for launching and passing parameters Device<->Host, that it would probably be much much slower!
For simple vector operations you can use SIMD instructions from CPU.
thanks for the quick answers. so lets they you have to work with a million pixels, and each pixel consists of 3 values. do you get a better performance when you handel them in one big float vector like that[codebox]device void vecAdd(float* a, float* b, float* res)
{
int i = (blockIdx.x*blockDim.x) +threadIdx.x;
res[i] = a[i]+b[i];
}
vecAdd<<<5860,512>>>(a,b,c);[/codebox]
or a float3 vector with code something like that (not sure if that code is working but i think you get the idea)
Best way to answer your question - do some testing and check which one is better.
You could also check out using float4 instead, because there is a hardware instruction for reading 4 aligned floats, while I believe there is none for 3.
On the other hand, in some more advanced algorithms where you need to load pixel data into shared memory, I would advice not using float4 as this is likely to cause bank conflicts!
If either method works for your problem, then I would give preference to the float version, rather than the float3. As was mentioned, there is no 96 bit load instruction, so the compiler has to split a float3 read into a 64 bit and 32 bit read instruction. On the GTX 200 series cards this wastes some bandwidth, but on the older (and the mobile) GPUs, the memory controller completely uncoalesces the memory transactions. This creates a large performance loss. There are workarounds, but if you can deal with the floats individually, then that is easier.
Is there any specific advantage/disadvantage using float3 instead of 3 floats? or more general what options do built in vectors over except that you can can access their componets via .x/y…
There is only an advantage (some might say a requirement) if you need a thread to access all three components (x,y,z). For example, if each float3 represents a 3D vector in space, and your kernel is performing some kind of rotation operation, you need to load all three components of the float3 into the same thread to do the math.
One might also consider using float4 with the last variable unused. That makes for a bigger memory footprint but is coalesced and may actually work faster.