Strange behaviour. Execution failed probably bug in compiler

I’ve experienced a strange problem with CUDA. After some code reduction I got a small code with error (see attachment for full listing).

Two different orders of two strings leads in one case to successfull execution and in another - to silent (with no error report) failure.

 base_jp_jf = base[jp*nfreq+jf];

  ce.x = temp*cosf(cd.y);
 ce.x = temp*cosf(cd.y);

  base_jp_jf = base[jp*nfreq+jf];

In emulation mode both variants are workable.

Any suggestions?

CUDA SDK/Toolkit v 1.1

NVIDIA GeForce 8800 GTS Driver version 6.14.11.6909

( The problem also repeated on CUDA v.1.0 and according driver )
smallest_code.zip (5.32 KB)

Can you generate .cubin files for both orders and post them here? You can do this by specifying “-cubin” option when compiuling with nvcc.
I can’t do that because your code depends on DirectX stuff…

EDT:
What I’ d like to check is that changing order of operations doesn’t change register usage considerably, because if it does (which is possible) then it’s clear why your kernel fails to launch.

Thanks for advice, AndreiB!

The produced binaries are different, actually. In working case its requires 19 registers vs 20 in bad case.

I set -maxrregcount to 19. Both variants are ok now, so this sample is ok now.

However, there is still problem wih my entire code - now I have “execution timed out” error (I consider there is one of the next reasons for that link)

PS: Does anyone know is it correct values for CUDA (is there any overflow?):

architecture {sm_10}

abiversion {0}

consts  {

    name = __cudart_i2opi_f

    segname = const

    segnum = 0

    offset = 0

    bytes = 24

    mem  {

        0x3c439041 0xdb629599 0xf534ddc0 0xfc2757d1 

        0x4e441529 0xa2f9836e 

    }

}

code  {

    name = _Z11profileMoveP6float2iPiS1_PffffffiS2_S2_S2_ffiiiS2_S2_S2_S2_S2_S1_iS2_S2_

    lmem = 288

    smem = 128

    reg = 20

    bar = 0

    bincode  {

        0xd001f001 0x60c00780 0xd0800211 0x00400780 

        ... total 13184 bytes of code

    }

    const  {

        segname = const

        segnum = 1

        offset = 0

        bytes = 160

        mem  {

            0x000003ff 0x00000001 0xffffffff 0000000000 

            0x358637bd 0x7e800000 0x7f800000 0x473ba700 

            0x80000000 0x00000018 0x00010000 0x0000001f 

            0x3fc90000 0x39fd8000 0x34a88000 0x2e85a309 

            0xb94ca1f9 0xbe2aaaa3 0x37ccf5ce 0x3d2aaaa5 

            0xbf000000 0x3f800000 0x00000002 0x42d20000 

            0xc2d20000 0x3f317200 0x35bfbe8e 0x00000058 

            0x00000009 0xbf52c7ea 0xc0d21907 0x41e6bd60 

            0x419d92c8 0x7fffffff 0x3fc90fdb 0xbfc90fdb 

            0x38d1b717 0x3e800000 0xb70477b3 0xb60477b3 

        }

    }

}

//update: code reduction