nvcc horribly breaking float4 read

A curious little example code:

void __global__ test(float2 *d, const float4 *s, float2 *t) {

  float4 v = s[threadIdx.x];

// *t = v.w;

  if (v.w)

    *d = (float2){v.x, v.y};

}

Compiling this with e.g. nvcc --ptx results in code that does 3 32 bit reads, and each of these are the also uncoalesced, giving simply horrible performance.

However, the compiler in principle seems to know how to do it right, because after commenting out the “t” assignment it does a single ld.global.v4.f32 !

Does anyone have an idea how to solve this? I tried making s volatile, too, but that results in:

error: class "float4" has no suitable copy constructor

I guess there is no way to make nvcc just a C compiler instead of getting in my way with C++ nonsense I certainly have no use for in a kernel?! (there is only the --host-compilation option that has no effect here).

There are threads on this subject. Trouble is I cannot find them.

As for your code, does it work when done like this?

void __global__ test(float2 *d, const float4 *s, float2 *t) {

 float4 v = s[threadIdx.x];

 *t = make_float2(v.z, v.w);

 if (v.w)

   *d = make_float2(v.x, v.y);

}

ups, I had a typo there, t is float not float2. Though that does not really make a difference, neither does it matter if I use make_float or (float2){}.
The only thing that makes a difference is whether v.w (works with v.z or both just as well though IIRC) is used only in the condition or if it is stored (in a way that cannot be optimized away).
Actually it is even “funnier” if you replace “if (v.w)” by “if (v.z+v.w)”, because then the compiler generates a ld.global.v4.f32, but only loads the y, z and w components and loads the x component later in an extra read…
Today is one of the days where I again think C is not the way GPUs are meant to be programmed. Even the most stupid programmer you can find sometimes seems like progress compared to what compilers sometimes do :(

True, NVIDIA should add the const volatile copy versions of the basic vector types to the standard headers[1] - the basic types work with volatile just right but any custom types (like CUDA’s float2, float4 and int4, for instance) do need them separately defined.

So, if you cannot change your float4 const * s into four separate arrays of floats (to save 25% bandwidth), then try this code. It worked for me at least:

__device__ float4 get_float4(float4 const volatile & v) {

	return make_float4(v.x, v.y, v.z, v.w);

}

void __global__ test(float2 *d, const float4 *s, float *t) {

    float4 v = get_float4(s[threadIdx.x]);

    if (v.w) *d = make_float2(v.x, v.y);

}

[1] For folks at NVIDIA: For volatile to work properly, every basic vector type should have both const and const volatile copy constructors and copy assignments, ie. <vector_types.h> should be changed as follows:

#define DECL __device__ __host__

struct __builtin_align__(16) float4 {

    float x, y, z, w;

#ifdef __cplusplus

        DECL float4(float4 const & r) : x(r.x), y(r.y), z(r.z), w(r.w) {}

        DECL float4(float4 const volatile & r) : x(r.x), y(r.y), z(r.z), w(r.w) {}

        DECL float4 & operator=(float4 const & r) { x=r.x; y=r.y; z=r.z; w=r.w; return *this; }

        DECL float4 & operator=(float4 const volatile & r) { x=r.x; y=r.y; z=r.z; w=r.w; return *this; }

#endif

};

/Pyry

[quote name=‘pyrtsa’ date=‘Jun 24 2008, 05:05 PM’]

So, if you cannot change your float4 const * s into four separate arrays of floats (to save 25% bandwidth), then try this code. It worked for me at least:

[quote]

Yes, it works fine like that, thanks, I just have/had my doubts how reliable volatile is. I have used tex1Dfetch before which gives similar results, but I do not like wasting texture cache for data that is used exactly once (I do have other reads that actually make good use of it).

I would have split it in separate arrays, but contrary to the simplified example I do use all elements of the float4 and even in all cases, so this would just need more read instructions and save nothing.

To be honest, I personally would find it even nicer if either the compiler left reads and writes to global memory completely alone or if there are functions to do these accesses that the compiler is guaranteed not to touch (volatile will probably do that, it is just a bit non-obvious to my tastes). Something that can change performance by an order of magnitude IMO is just not safe to leave to a compiler.

Good idea! It could be even something like this:

template <typename T> T load_global(T const * address);

Usage:

__global__ test(float * output, float4 const * input) {

    unsigned offset = blockIdx.x * blockDim.x + threadIdx.x;

    float4 x = load_global(input + offset);

    output[offset] = x.x + x.y + x.z + x.w;

}

As a matter of fact, the above can already be implemented with the aid of volatile. But what NVIDIA could do, is to not only implement that in the CUDA library but also implement something like these complementing functions for shared memory (for help in problems like this):

float    load_shared(float const * address);

int      load_shared(int const * address);

unsigned load_shared(unsigned const * address);

void store_shared(float * address, float value);

void store_shared(int * address, int value);

void store_shared(unsigned * address, unsigned value);

They all map well to their corresponding PTX instructions (ld.shared, st.shared etc) without confusion.

PS. I think you can rest assured that volatile will work as expected - it’s even documented in the PTX ISA since version 1.1.