How to stop compiler putting structs in local? And suboptimising your program

osiris1 · June 7, 2007, 3:21am

After trying everything i could think of in 0.9 I am left with the problem of how to prevent the compiler putting a float4 struct in local when the whole point of using it was to get some level of coalescing while sequentially scanning a large matrix transpose. Even with a very big maxrregcount the resulting cubin still has the offending auto float4 in lmem. Simple copy examples seem to work but as soon as you do some computation on all individual elements off it goes to lmem.

The resulting code for the read does a v4.f32 read (which has 1/3 coalescing on 8800, 100% on 8600) then immediately 4 stores to local each with a stride of 16 bytes so only 1/4 coalesced then every reference to the elements of the float4 does a read with a stride of 16 bytes which is only 1/4 coalesced. The cubin sure looks like all these instructions are still there.

Even tried the “register” keyword but that is probably deleted by cpp these days.

Thanks,
Eric

Also I wonder why everyone gets 1/2 performance reading float4s compared to floats?

ed: This gets even stranger - if I put a vector store back to the original location with the same data at the end of the block of my code that used the 4 components then the store is done as a vector store from the $f registers that were picked up individually from local in the above segment. So there is absolutely no reason to put the data out to local - does not save a single register but the compiler insists upon doing it???

wumpus · June 7, 2007, 12:48pm

In my experiences with CUDA I also found that using float4/float3/… types wasn’t such a good idea. Sometimes they even result in buggy behaviour, and as the G80 is a scalar processor they do not give you any gain. You could maybe fix this by using an array of floats instead of a structure.

Cyril_Zeller · June 8, 2007, 5:31pm

Please, file a bug on the registered developer website with some repro.
Thanks!

osiris1 · June 10, 2007, 3:18am

In the process of preparing the repro I found the exact trigger for my problem so I will document it here rather than through a bug report as it may help another user immediately.

With decl “float4* pf4” ALL cases of read and increment of pf4 work EXCEPT “pf4++" which triggers the problem ("++pf4” “pf4[0]; ++pf4;” etc all work OK).

Is there a WOPT flag to nvopencc to turn off local? (+ something to ptxas to stop it doing anything) - if not any plans?

Thanks,

Eric

ed: the repro was trivial:

__device__ float4       c[4];

__global__ void

kernel()

{

    float4*     pf4;

    float4      f4;

   pf4 = c;

    f4 = *pf4++;

    pf4[0] = f4;

}

Compiles to:

$LBB1_kernel:

        mov.u32         $r1, c;                 //

        mov.u32         $r2, f4$0;              //

        ld.global.v4.u32        {$r3,$r4,$r5,$r6}, [$r1+0];     //

        st.local.u32    [$r2+0], $r3;           //  id:30 f4$0+0x0

        st.local.u32    [$r2+4], $r4;           //  id:30 f4$0+0x0

        st.local.u32    [$r2+8], $r5;           //  id:30 f4$0+0x0

        st.local.u32    [$r2+12], $r6;          //  id:30 f4$0+0x0

        .loc    12      11      0

        ld.local.f32    $f1, [f4$0+0];          //  id:22 f4$0+0x0

        ld.local.f32    $f2, [f4$0+4];          //  id:24 f4$0+0x4

        ld.local.f32    $f3, [f4$0+8];          //  id:26 f4$0+0x8

        ld.local.f32    $f4, [f4$0+12];         //  id:28 f4$0+0xc

        st.global.v4.f32        [c+16], {$f1,$f2,$f3,$f4};      //

        exit;                           //

with a cubin

       name = kernel

        lmem = 16

        smem = 0

        reg = 5