unexpected unroll issue "unroll" changes output for the better

I have a small kernel that gives the correct answer if the inner loop is preceded by:

#pragma unroll

and an incorrect result otherwise.

The correct result can also be obtained by placing a __syncthreads() call at the end of the loop. This suggests to me a timing problem between threads… but I’m not writing to shared memory, so I don’t know where such a timing problem would be. I do read data from shared memory (and threads access the same locations at times), but I have a __syncthreads() call after the loading and before the access of that data.

My question is: does anyone see a problem with my inner loop (shown below) that would cause the value written out to g_Result (in global memory) to be wrong? I am keeping 4 running sums in that loop. If I keep only two (sum0 and sum1) I don’t see the problem. Any ideas?

I have 256 threads in a block and 1024 blocks.

This is a dumbed down version of the code that still exhibits the problem…

       __shared__ unsigned char dataA[32*32];  

        __shared__ unsigned char dataB[64*96];  

//{some bunch of code to load the data into shared memory}

        __syncthreads();

       unsigned char A;        

        unsigned char B;

        int diff;

        int sum0, sum1, sum2, sum3;

       sum0 = 0;

        sum1 = 0;

        sum2 = 0;

        sum3 = 0;

        rowStartA = 0;

        rowStartB = IMUL(threadIdx.y,96)+IMUL(threadIdx.x,4)+0;

#pragma unroll 1  // changeing this to "#pragma unroll" fixes the problem!

        for(int x=0; x<2; x++){

                A = dataA[rowStartA+x];

                B = dataB[rowStartB+x];

                diff = (A - B);

                sum0 += IMUL(diff,diff);

               B = dataB[rowStartB+x+1];

                diff = (A - B);

                sum1 += IMUL(diff,diff);

               B = dataB[rowStartB+x+2];

                diff = (A - B);

                sum2 += IMUL(diff,diff);

               B = dataB[rowStartB+x+3];

                diff = (A - B);

                sum3 += IMUL(diff,diff);

//                __syncthreads();  // putting this in fixes the problem

        }

        g_Result[ty*dataW + tx + 0] = (float)sum0;

        g_Result[ty*dataW + tx + 1] = (float)sum1;

        g_Result[ty*dataW + tx + 2] = (float)sum2;

        g_Result[ty*dataW + tx + 3] = (float)sum3;

}

Thanks for your help,

Troy

Hi,

Have you tried inserting syncthreads outside the loop(2 lines below) to see if it works? If the strange behavior persists maybe its due to the instructions that follow. Looks to me that your threads are writing to the same memory locations (depending on your block structure, I could be wrong) 4 times. I’m assuming g_Result is global memory.

In the above, the address for statement 1 for tx=t and statement 2 for tx = t+1 are the same. So on for the rest of the statements and addresses. Clearly, what is written to global memory is sensitive to the order in which warps are executed and hence, output might differ with/without syncthreads. Maybe you wanted to put 4*tx instead of tx?

Just a suggestion…

PS. Assumption: ty and tx are thread indices

Anjul,

That was an excellent suggestion, and a great catch on the code… You had no way of knowing that my tx variable is the thread index * 4. I’m trying to have each thread work on 4 adjacent pixels to help with shared memory bank conflicts. Here’s the code that calculates tx an ty

const   int bx = (blockDim.x * blockIdx.x) * 4;

	const   int by = (blockDim.y * blockIdx.y);

	const   int tx = bx + threadIdx.x * 4;

	const   int ty = by + threadIdx.y;

I tried your suggestion, and placed 4 syncthreads calls outside the loop before each write to global memory, but it had no effect. Could the problem have something to do with my thread stride of 4 pixels? This seems to be somewhat unusual (compared to the SDK examples which work on a single pixel/thread.)

Thank you very much for your effort,

Troy

If all you want is to avoid shared memory bank conflicts, you don’t need to run in groups of 4 pixels. In fact, its actually better to use one pixel per thread. Think of it as the following:-

Your present situation:

Access            Thread 1        Thread 2        Thread 3        Thread 4        Thread 5

1                    0                  4                 8               12              16

2                    1                  5                 9               13              17

3                    2                  6                 10              14             18

4                    3                  7                 11              15             19

Can you see how threads 1 and 5 have bank conflicts? Now if you do it the following way, you have no conflicts:

Access            Thread 1        Thread 2        Thread 3        Thread 4       Thread 5  ... Thread 15(last of the half-warp - no conflicts)

1                      0                1               2              3                4              15

2                     17                18               19             20              21              31

:

:

Bank conflicts only occur on simultaneous accesses, not successive ones. Everything is tuned to have stride 1 accesses working fastest (hence the SDK samples).

Still can figure out your problem with syncthreads though.