I see that there are many vector data types declared in CUDA. Like “float4”, “int4” etc… But I dont see (pardon my oversight if any) any vector operations that can be applied on them.
For example I tried to do this but could not get it compiled for Device :
shared extern float4 prices;
…
prices[i] = prices[i]*2
The compiler said “float4*int” is NOt possible. So, I tried this:
prices[i] = prices[i]*(2,2,2,2)
but this one too did NOT work.
Does PTX support Vector instructions? How can I take advantage of this one?
The only advantage I see now is that I can access more global memory per warp resulting in amazing speedups (I got 20X as pointed out by Mark in some earlier post)
Look in cutil_math.h for overloaded operators to use in code compiled with nvcc. Or your can create your own. nvcc supports some C++ features like operator overloading and templates.
I’m not sure if PTX has vector operations, but I don’t think so.
Interesting! So, What purpose do these vector data types serve? Is there other NVIDIA cards that support vector processing? Are these data types meant for those cards?
So, If all threads in a warp access “Float4” elements instead of “float” and assuming that all such float4 elements are in consecutive addresses, Is it possible to achieve good performance gain using “float4” ?
Nice to know that. I tried similar thing with float4 and did NOT get correct results. Resulted in an unusual long period of execution and wrong results. I think I must have messed something.
According to the manual, The memory coalescing works best for 32-bit coalescing. 64-bit and 128-bit are NOT that effective though they are faster than non-coalesced accesses. 1.1 Manual , Page 65/143, Pg no 51, Last Paragraph.
Coalesced 128-bit reads are faster only by a factor of 2 when compared to non-coalesced reads. But still will have to be much faster than processing 1 character at a time. So, your speed up looks justifiable. Aah… or else – may b, you should try reading 4 characters per thread and see what kind of speed you get. I think that would be an interesting thing to look at. If you ever do it, Can you kindly post the results here. Thanks.
That is only 40.7 GiB/s, you can nearly double your kernel’s performance. Sarnath is correct that 128-bit coalesced reads are slow. See my testing in this post http://forums.nvidia.com/index.php?showtop…ndpost&p=290441 to confirm the manual’s claims.
Are you sure that this
in_pixel.x = g_idata [globalTid].x;
in_pixel.y = g_idata [globalTid].y;
in_pixel.z = g_idata [globalTid].z;
in_pixel.w = g_idata [globalTid].w;
results in a coalesced read? I for one, wouldn’t trust the compiler to be that smart and would write in_pixel = g_idata[globalTid];
point 1) doesn’t matter anyways given the testing in the post I mentioned. Create a simple 1D texture and bind g_idata to it. Then access in_pixel with
in_pixel = tex1Dfetch(tex, globalTid); and watch your performance nearly double.
No problem, you’re welcome. I stumbled across the 128-bit coalesced read → tex1Dfetch read optimization a while back (boosted my code’s overall performance by 6%) and it is unintuitive enough I’ve tried to share it on the forums when these issues come up.
I understand that when performance is as fast as you need it there is little reason to waste your time optimizing, but lately I’ve been trying to get every single percent of performance I can out of my code so I can’t escape that mindset :) One particular unoptimized part of my code that was “fast enough” a few months ago with only 1% of overhead now takes up almost 10% so now it actually is worth it to go back and optimize things.