nvcc horribly breaking float4 read

Reimar · June 24, 2008, 12:30pm

A curious little example code:

void __global__ test(float2 *d, const float4 *s, float2 *t) {

  float4 v = s[threadIdx.x];

// *t = v.w;

  if (v.w)

    *d = (float2){v.x, v.y};

}

Compiling this with e.g. nvcc --ptx results in code that does 3 32 bit reads, and each of these are the also uncoalesced, giving simply horrible performance.

However, the compiler in principle seems to know how to do it right, because after commenting out the “t” assignment it does a single ld.global.v4.f32 !

Does anyone have an idea how to solve this? I tried making s volatile, too, but that results in:

error: class "float4" has no suitable copy constructor

I guess there is no way to make nvcc just a C compiler instead of getting in my way with C++ nonsense I certainly have no use for in a kernel?! (there is only the --host-compilation option that has no effect here).

E.D_Riedijk · June 24, 2008, 1:16pm

There are threads on this subject. Trouble is I cannot find them.

As for your code, does it work when done like this?

void __global__ test(float2 *d, const float4 *s, float2 *t) {

 float4 v = s[threadIdx.x];

 *t = make_float2(v.z, v.w);

 if (v.w)

   *d = make_float2(v.x, v.y);

}

Reimar · June 24, 2008, 1:38pm

ups, I had a typo there, t is float not float2. Though that does not really make a difference, neither does it matter if I use make_float or (float2){}.
The only thing that makes a difference is whether v.w (works with v.z or both just as well though IIRC) is used only in the condition or if it is stored (in a way that cannot be optimized away).
Actually it is even “funnier” if you replace “if (v.w)” by “if (v.z+v.w)”, because then the compiler generates a ld.global.v4.f32, but only loads the y, z and w components and loads the x component later in an extra read…
Today is one of the days where I again think C is not the way GPUs are meant to be programmed. Even the most stupid programmer you can find sometimes seems like progress compared to what compilers sometimes do :(

pyrtsa · June 24, 2008, 4:05pm

True, NVIDIA should add the const volatile copy versions of the basic vector types to the standard headers[1] - the basic types work with volatile just right but any custom types (like CUDA’s float2, float4 and int4, for instance) do need them separately defined.

So, if you cannot change your float4 const * s into four separate arrays of floats (to save 25% bandwidth), then try this code. It worked for me at least:

__device__ float4 get_float4(float4 const volatile & v) {

	return make_float4(v.x, v.y, v.z, v.w);

}

void __global__ test(float2 *d, const float4 *s, float *t) {

    float4 v = get_float4(s[threadIdx.x]);

    if (v.w) *d = make_float2(v.x, v.y);

}

[1] For folks at NVIDIA: For volatile to work properly, every basic vector type should have both const and const volatile copy constructors and copy assignments, ie. <vector_types.h> should be changed as follows:

#define DECL __device__ __host__

struct __builtin_align__(16) float4 {

    float x, y, z, w;

#ifdef __cplusplus

        DECL float4(float4 const & r) : x(r.x), y(r.y), z(r.z), w(r.w) {}

        DECL float4(float4 const volatile & r) : x(r.x), y(r.y), z(r.z), w(r.w) {}

        DECL float4 & operator=(float4 const & r) { x=r.x; y=r.y; z=r.z; w=r.w; return *this; }

        DECL float4 & operator=(float4 const volatile & r) { x=r.x; y=r.y; z=r.z; w=r.w; return *this; }

#endif

};

/Pyry

Reimar · June 25, 2008, 6:40am

[quote name=‘pyrtsa’ date=‘Jun 24 2008, 05:05 PM’]

So, if you cannot change your float4 const * s into four separate arrays of floats (to save 25% bandwidth), then try this code. It worked for me at least:

[quote]

Yes, it works fine like that, thanks, I just have/had my doubts how reliable volatile is. I have used tex1Dfetch before which gives similar results, but I do not like wasting texture cache for data that is used exactly once (I do have other reads that actually make good use of it).

I would have split it in separate arrays, but contrary to the simplified example I do use all elements of the float4 and even in all cases, so this would just need more read instructions and save nothing.

To be honest, I personally would find it even nicer if either the compiler left reads and writes to global memory completely alone or if there are functions to do these accesses that the compiler is guaranteed not to touch (volatile will probably do that, it is just a bit non-obvious to my tastes). Something that can change performance by an order of magnitude IMO is just not safe to leave to a compiler.

pyrtsa · June 25, 2008, 11:05am

Good idea! It could be even something like this:

template <typename T> T load_global(T const * address);

Usage:

__global__ test(float * output, float4 const * input) {

    unsigned offset = blockIdx.x * blockDim.x + threadIdx.x;

    float4 x = load_global(input + offset);

    output[offset] = x.x + x.y + x.z + x.w;

}

As a matter of fact, the above can already be implemented with the aid of volatile. But what NVIDIA could do, is to not only implement that in the CUDA library but also implement something like these complementing functions for shared memory (for help in problems like this):

float    load_shared(float const * address);

int      load_shared(int const * address);

unsigned load_shared(unsigned const * address);

void store_shared(float * address, float value);

void store_shared(int * address, int value);

void store_shared(unsigned * address, unsigned value);

They all map well to their corresponding PTX instructions (ld.shared, st.shared etc) without confusion.

PS. I think you can rest assured that volatile will work as expected - it’s even documented in the PTX ISA since version 1.1.

Topic		Replies	Views
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4061	October 15, 2008
Coalescing issue, presumably due to the CUDA Optimizer CUDA Programming and Performance	18	3156	December 9, 2009
Reading from float4 corrupts memory CUDA Programming and Performance cuda	18	47	October 3, 2024
pointer to shared memory compiler problems CUDA Programming and Performance	19	14534	June 7, 2008
Difference in Performance CUDA Programming and Performance	13	9738	August 20, 2008
BUG? nvcc compiler unnecessary splits 8 bytes into 2 4 byte ones CUDA Programming and Performance	5	2098	May 25, 2009
Improve performance using volatile CUDA Programming and Performance	8	7031	July 15, 2011
Boost Functional compile error CUDA Programming and Performance	8	3505	February 13, 2011
Float2 Coalescing CUDA Programming and Performance	12	10672	December 9, 2010
What causes ptx compiler to demote shared variable? CUDA Programming and Performance	2	1229	August 17, 2019

nvcc horribly breaking float4 read

Related topics