Compiling this with e.g. nvcc --ptx results in code that does 3 32 bit reads, and each of these are the also uncoalesced, giving simply horrible performance.
However, the compiler in principle seems to know how to do it right, because after commenting out the “t” assignment it does a single ld.global.v4.f32 !
Does anyone have an idea how to solve this? I tried making s volatile, too, but that results in:
error: class "float4" has no suitable copy constructor
I guess there is no way to make nvcc just a C compiler instead of getting in my way with C++ nonsense I certainly have no use for in a kernel?! (there is only the --host-compilation option that has no effect here).
ups, I had a typo there, t is float not float2. Though that does not really make a difference, neither does it matter if I use make_float or (float2){}.
The only thing that makes a difference is whether v.w (works with v.z or both just as well though IIRC) is used only in the condition or if it is stored (in a way that cannot be optimized away).
Actually it is even “funnier” if you replace “if (v.w)” by “if (v.z+v.w)”, because then the compiler generates a ld.global.v4.f32, but only loads the y, z and w components and loads the x component later in an extra read…
Today is one of the days where I again think C is not the way GPUs are meant to be programmed. Even the most stupid programmer you can find sometimes seems like progress compared to what compilers sometimes do :(
True, NVIDIA should add the const volatile copy versions of the basic vector types to the standard headers[1] - the basic types work with volatile just right but any custom types (like CUDA’s float2, float4 and int4, for instance) do need them separately defined.
So, if you cannot change your float4 const * s into four separate arrays of floats (to save 25% bandwidth), then try this code. It worked for me at least:
[1] For folks at NVIDIA: For volatile to work properly, every basic vector type should have both const and const volatile copy constructors and copy assignments, ie. <vector_types.h> should be changed as follows:
So, if you cannot change your float4 const * s into four separate arrays of floats (to save 25% bandwidth), then try this code. It worked for me at least:
[quote]
Yes, it works fine like that, thanks, I just have/had my doubts how reliable volatile is. I have used tex1Dfetch before which gives similar results, but I do not like wasting texture cache for data that is used exactly once (I do have other reads that actually make good use of it).
I would have split it in separate arrays, but contrary to the simplified example I do use all elements of the float4 and even in all cases, so this would just need more read instructions and save nothing.
To be honest, I personally would find it even nicer if either the compiler left reads and writes to global memory completely alone or if there are functions to do these accesses that the compiler is guaranteed not to touch (volatile will probably do that, it is just a bit non-obvious to my tastes). Something that can change performance by an order of magnitude IMO is just not safe to leave to a compiler.
As a matter of fact, the above can already be implemented with the aid of volatile. But what NVIDIA could do, is to not only implement that in the CUDA library but also implement something like these complementing functions for shared memory (for help in problems like this):