Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

In this Discussion

CUDA 4.1 RC2 Compiler Changes / "Broken" Code ? [updated]
  • Hi,

    with the new CUDA Compiler i have several problems/issues:

    The compiler seems more stupid.


    Example:

    uint64_t tmp = 0;
    for(uint32_t j=0; j < 64; j++)
    tmp |= (( cipherbits[j] & ( (uint64_t)1<<(keynr) ) )>>(keynr) )<<((uint64_t)j);

    This worked fine with the old compilers. All used variables are of type uint64_t or casted to uint64_t.
    But now if have to use this one or the code delivers unexpected results:

    uint64_t tmp = 0;
    for(uint32_t j=0; j < 64; j++)
    tmp |= (uint64_t)((uint64_t)( cipherbits[j] & (uint64_t)( (uint64_t)1<<(keynr) ) )>>(keynr) )<<((uint64_t)j);

    This issue seems located at another part of the source. I could not locate the issue right now.

    The compiler seems broken in at least one case


    After having a look at the ptx Code generated by the compiler i could not realize what the new nvcc tries there. The function in the Problem below seems to produce the unexpected results here.
    The ptx source for the specific function has about the 5x more lines then with nvcc 4.0.
    Here some details from the ptx files:

    .reg .u32 %r<10>;
    .reg .u64 %rd<11060>;


    .reg .s32 %r<901>;
    .reg .s64 %rl<10245>;

    Why does the nvcc 4.1 need 90x more 32bit registers the the old one? And why are the registers SIGNED? We need unsigned. That seems to be the problem.

    Also the bar.sync Calls seems to be messed up. There is one __sync call just before the final operation (simple addition in specific threads). But after the last call with the new compiler, there follows >10.000 lines of code WITHOUT any bra. So i could not see where the code does sync the threads before the final operation.


    The same code needs more registers


    With the old compilers, the example kernel used 39 registers per thread. The kernel does heavy bitwise operations on uint64_t values.
    Now the same code uses all 63 regisrers and does register spilling:

    ptxas info : Compiling entry function '_Z6phase1P9device' for 'sm_21'
    ptxas info : Function properties for _Z6phase1P9device
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info : Used 39 registers, 36 bytes cmem[0], 12 bytes cmem[14]


    ptxas info : Compiling entry function '_Z6phase1P9device' for 'sm_21'
    ptxas info : Function properties for _Z6phase1P9device
    5768 bytes stack frame, 5768 bytes spill stores, 7192 bytes spill loads
    ptxas info : Used 63 registers, 40 bytes cmem[0], 40 bytes cmem[14]


    We have invested hours to resolve the register bottleneck in the specific function. So will there be any fix? Or is there a known workaround for the high register usage (compiler options?)?

    The new compiler is not usable for us at all and breaks our project. If there will be no fix, this CUDA project died after 6 Months of development! Please help!



    Thomas
  • 2 Comments sorted by
  • I have the same problem and nobody from cuda gives solutions to us.