Weird error and weird solution (I found)!

I have a Tesla C1060 on a Fedora 10 machine, with CUDA 2.1. Trying to debug my code (I cannot use cuda-gdb because I get an error caused by a bug known to CUDA developers) I store variables in the global memory and I read them back.

In my code I call the following function “foo” which worked correctly in emulation mode but not on the device. Trying to debug it I started storing local variables into global memory to get them back to the host. Very quickly, to my surprise, I realized that storing the local variables in global memory affects the outcome of the function. The error was caused by the wrong evaluation of the expression “((q_off - n +i)&0x03) == 0x03.” When “query_temp” was supposed to get the value of “Query_AA2,” or “Query_AA3” was getting the value of “Query_AA1” (see code below). “Sequence_XX’s” get the correct values.

I couldn’t find what was the cause of the error, but the “hack” I found around it is to store the expression “((q_off - n +i)&0x03) == 0x03” somewhere in global memory and everything is fine. So, I created this “d_Dummy” memory space where each thread stores that expression, and this memory space is not used in any other way (it is not read by the host or anything else).

Any ideas what really causes my problem and why this “dummy” storing of the expression “((q_off - n +i)&0x03) == 0x03” makes the function to work properly?

The “hack” I found around it suggests to me that the compiler performs some kind of optimization on the evaluation of the “((q_off - n +i)&0x03) == 0x03” and when I introduce the storing this optimization is not performed (I tried compiling -O0 and it didn’t work)

[codebox]device int foo(const char* matrix, const int* subject, const int* query, int s_off, int q_off, int dropoff, int maxscore, int left_disp, int d_Hits, const int* d_RepeatedQuery, int* d_Dummy){

int n, best_i;

int score = maxscore;

n = MIN(s_off, q_off);

best_i = n + 1;

int Query_AAI = 0, Sequence_AAI = 0;

char Sequence_AA0 = 0, Sequence_AA1 = 0, Sequence_AA2 = 0, Sequence_AA3 = 0;

char Query_AA0 = 0, Query_AA1 = 0, Query_AA2 = 0, Query_AA3 = 0;

char query_temp = 0, sequence_temp = 0;

int d_Dummy_Index = blockIdx.x*blockDim.x + threadIdx.x;

for (int i = n; i >= 0; i–) {

Sequence_AAI = subject[ (s_off - n + i)>>2 ];

        Sequence_AA0 = (Sequence_AAI & 0x000000FF) >>  0;

        Sequence_AA1 = (Sequence_AAI & 0x0000FF00) >>  8;

        Sequence_AA2 = (Sequence_AAI & 0x00FF0000) >> 16;

        Sequence_AA3 = (Sequence_AAI & 0xFF000000) >> 24;

Query_AAI = query[ (q_off - n + i)>>2 ];

        Query_AA0 = (Query_AAI & 0x000000FF) >>  0;

        Query_AA1 = (Query_AAI & 0x0000FF00) >>  8;

        Query_AA2 = (Query_AAI & 0x00FF0000) >> 16;

        Query_AA3 = (Query_AAI & 0xFF000000) >> 24;

if ( 0x0 == ((q_off - n + i)&0x3)) query_temp = Query_AA0;

    else if( 0x1 == ((q_off - n + i)&0x3))  query_temp = Query_AA1;

    else if( 0x2 == ((q_off - n + i)&0x3))  query_temp = Query_AA2;

    else if( 0x3 == ((q_off - n + i)&0x3))  query_temp = Query_AA3;

if ( 0x0 == ((s_off - n + i)&0x3)) sequence_temp = Sequence_AA0;

    else if( 0x1 == ((s_off - n + i)&0x3))  sequence_temp = Sequence_AA1;

    else if( 0x2 == ((s_off - n + i)&0x3))  sequence_temp = Sequence_AA2;

    else if( 0x3 == ((s_off - n + i)&0x3))  sequence_temp = Sequence_AA3;

score += matrix[ query_temp*ALPHABET_SIZE + sequence_temp ];

if (score > maxscore) {

        maxscore = score;

        best_i = i;

    }

//This is a dummy store which gets rid of the bug of the Query_AA2 and Query_AA3 getting the value of Query_AA1.

    //From trial and error I found out that if I store "((q_off - n +i)&0x03) == 0x03" somewhere in memory the Query_AA2 and

    //Query_AA3 take the right value!! I don't know why.

    d_Dummy[ d_Dummy_Index  ] = ((q_off - n +i)&0x03) == 0x03;

if ((maxscore - score) >= dropoff)

        break;

}

*left_disp = n - best_i + 1;

return maxscore;

}

[/codebox]

Ok, this is a complete wild goose-chase, but can you try declaring Query_AAI and Sequence_AAI as unsigned integers and convert subject and query arrays to unsigned values as well:
so that

  • int Query_AAI = 0, Sequence_AAI = 0;
    =>
  • unsigned int Query_AAI = 0, Sequence_AAI = 0;

and

const int* subject, const int* query => const unsigned int* subject, const unsigned int* query

This is because I see a potential problem with expression

Query_AA3 = (Query_AAI & 0xFF000000) >> 24;

since if Query_AAI is a signed integer, then it is possible that (Query_AAI & 0xFF000000) has its most-significant bit set making it a negative number and the result of >> on a negative integer is implementation specific!! Meaning it can give different results on emulator and on device, because the c-standard does not specify whether >> means a logical (do not preserve sign-bit) or an arithmetic (bit replicate sign-bit) shift.

Also it seems that the chars are signed (or maybe not specifying the sign of char causes it to be implementation specific as well, which could result in more problems - ie. you can lose you sign bit in (Query_AAI & 0x00FF0000) >> 16; and casting this always positive number to a signed char cannot result in negative value (ie. instead of getting numbers in [-128, 127] you only can get [0, 127] for example…)).

Normally C-compilers perform arithmetic shift on signed integers and logical shift on unsigned integers so:
Always use unsigned integers, when are storing values in a bitfield! :)

Can you convert all the values stored in bitfields to unsigned values and rerun the program?

But keep in mind that this is only based on a hunch I had…

Oops, forgot to mention also:

  • char Sequence_AA0 = 0, Sequence_AA1 = 0, Sequence_AA2 = 0, Sequence_AA3 = 0;
  • char Query_AA0 = 0, Query_AA1 = 0, Query_AA2 = 0, Query_AA3 = 0;
  • char query_temp = 0, sequence_temp = 0;

=>

  • unsigned char Sequence_AA0 = 0, Sequence_AA1 = 0, Sequence_AA2 = 0, Sequence_AA3 = 0;
  • unsigned char Query_AA0 = 0, Query_AA1 = 0, Query_AA2 = 0, Query_AA3 = 0;
  • unsigned char query_temp = 0, sequence_temp = 0;