Optimization to LD.64 missing? back-to-back LD instructions not coalesced automatically

Is there a good reason why the compiler doesn’t convert these back-to-back float loads into an LD.64 instruction?

From cuobjdump:

/*0020*/     /*0x00081c8580000000*/     LD R32, [R0];

/*0028*/     /*0x1007dc8580000000*/     LD R31, [R0+0x4];

Thanks.

Good question. Is the content of R0 guaranteed to be sufficiently aligned? Are you using any instructions that prevent R31 and R32 from being interchanged (texture instructions come to mind)?

tera, thanks a lot for that link! It’s at least good that the issue is known, but it apparently has been around for several months now. :(

My test code is just trivial copy. It’s compiled with:

/usr/local/cuda/bin/nvcc -gencode arch=compute_20,code="sm_20,compute_20" test.cu

and nvcc version:


nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2012 NVIDIA Corporation

Built on Thu_Apr__5_02:05:07_PDT_2012

Cuda compilation tools, release 4.2, V0.2.1221


Should I file a bug report?

#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>

#define CUDA_CHECK_ERROR( call) do {                                         \

    cudaError err = (call);                                                  \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

                __FILE__, __LINE__, cudaGetErrorString( err) );              \

        exit(EXIT_FAILURE);                                                  \

    } } while (0)

//#define float double  // <-- uncomment to get LD.64 instructions

__device__ float src[1000];

__device__ float dst[1000];

__global__ void cpy()

{

    float x[32];

    for(int i = 0; i < 32; i++) x[i] = src[i + threadIdx.x * 32];

    for(int i = 0; i < 32; i++) dst[i + threadIdx.x * 32] = x[i];

}

int main()

{

    cpy<<<1, 1>>>();

    CUDA_CHECK_ERROR(cudaGetLastError());

    CUDA_CHECK_ERROR(cudaThreadSynchronize());

    return 0;

}

If you use float2 (instead of float) then LD.64 instructions are used, but if you use

typedef struct {float x, y;} f2t;

then LD.64 is NOT used! That is wrong! There should be no difference between float2 and f2t. So it looks like a hack: to fix the bug that tera links to, someone just went in there and added some hack code which ONLY fixes built-in types (like float2) and absolutely nothing else. :(

In fact, like the original post suggests, even ordinary back-to-back float copies should be coalesced into LD.64, it seems.

Data on GPUs needs to be accessed using the natural alignment of the data, that is, words need to be aligned on a word boundary, double words need to be aligned on a double-word boundary, and so on. Unaligned accesses lead to undefined behavior.

In general, a struct with two floats guarantees 4-byte alignment, since the size of a float is 4 bytes. A float2, however, has guaranteed 8-byte alignment. This then allows the compiler to safely generate the wider (64-bit) load instruction for a float2, while the struct of two floats needs to be accessed using two 32-bit loads, unless the compiler can prove the required alignment for the wider access by other means (it sometimes can do so when spilling registers, for example).

Programmers can force alignment of data with attributes and as far as I know this is how the float2 type is also implemented. See the following section in the CUDA C Programming Guide: 5.3.2.1.1 Size and Alignment Requirement

Nvidia / njuffa, I do stand corrected. Everything seems to work just great, exactly as you say, however, I did not find attribute((aligned(…))) in the guide. Is it documented somewhere?

Many thanks.

Please refresh your browser :-) attribute(aligned) is how one would do this with gcc, in CUDA there is align. Within 10 minutes of posting I corrected my forum post to remove the erroneous portion and point at the relevant section in the Programming Guide. Please see edited post above.

Okay, sorry I missed that. But… attribute(aligned) does apparently work! Here is the code:

#include <cuda.h>

#include <stdio.h>

#include <stdlib.h>

#define CUDA_CHECK_ERROR( call) do {                                         \

    cudaError err = (call);                                                  \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

                __FILE__, __LINE__, cudaGetErrorString( err) );              \

        exit(EXIT_FAILURE);                                                  \

    } } while (0)

struct _f2t {float x, y;} __attribute__((aligned(8)));

typedef struct _f2t f2t;

#define TYPE f2t

__device__ TYPE src[1000];

__device__ TYPE dst[1000];

__global__ void cpy()

{

    TYPE x[2];

    for(int i = 0; i < 2; i++) x[i] = src[i + threadIdx.x * 2];

    for(int i = 0; i < 2; i++) dst[i + threadIdx.x * 2] = x[i];

}

int main()

{

    cpy<<<1, 1>>>();

    CUDA_CHECK_ERROR(cudaGetLastError());

    CUDA_CHECK_ERROR(cudaThreadSynchronize());

    return 0;

}

EDIT: well I assume that the cross-platform way to do it is like the cuda guide says.

Correct, the CUDA align attribute works cross-platform, so one would want to use that instead of any platform-specific solution.

Which header file do I need to include for the compiler to “see” the CUDA float2 definition?

cuda.h. Aren’t you already using it?? External Image