Possible compiler bug?

I’ve got a strange behavior when I try to optimize a little bit the speed of my program.

The program is a simple linear convolution on an unsigned char image (based on the convolutionseparable example from the sdk). In order to coalesce the memory access, each thread has to load and compute 4 pixels (instead of one in the provided example).

Here is the piece of code which performs the computation :


int coalescedresult=0; // this int will aggregate 4 uchar values in order to write them together to the global mem

  unsigned char * concat = (unsigned char*)&coalescedresult; // using the int as an array of 4 uchars

for(int i=0;i<4;i++){ //processing the 4 pixels

  	short sum = 0;//performing the convolution (horizontally here)

  	unsigned char* temp=&tile[smemPos-FILTER_WIDTH+i];//catching the adress of the (n-2)nd pixel

  	sum= *temp;// the kernel is 1 4 6 4 1

  	temp ++;//moving to the net pixel, ...

  	sum += *temp * 4;

  	temp ++;

  	sum += *temp * 6;

  	temp ++;

  	sum += *temp * 4;

  	temp ++;

  	sum += *temp;	

 	concat[i]=(unsigned char)sum;//writing down the result for the first pixel to the first place in coalescedresult


  ((int*)result)[(rowStart + writePos)/4] = coalescedresult;//writing the whole int to global mem

So, the problem is that this code works very well in EmuDebug mode, but only computes the first pixel of each block of 4 pixels on the GPU.

After a day of research, I finally managed to have it working, simply adding a #pragma unroll 1 before my loop . I guess the compiler tried to unroll the loop and failed, but I can’t understand why

Any idea ?

Generally, if a piece of code works in emuDebug mode, but not on the gpu, then timing issues are to blame. You said it only computes the first pixel in each block. Are the other pixels garbage or are they 0 or some initialized values. My first thought would be some read/write ordering hazards with multiple warps or multiple thread blocks.

i have a very similar problem so i try to write here instead of open a new post. i have a code (see attach) that copy a 64 byte long vector of unsigned char from local to global memory one element at time in a very simple “for” loop.
this code work right if i use “#pragma unroll 1” for the “for” loop of 64 iteration but fail during execution if i let the nvcc to unroll it. i also discovered that i can modify the loop and let the compiler unroll it until the 59th iter and it work right but if i compile the code with 60 iteration it fails (actually i need only the firsts 6/7 characters but i would like to understand what happen). the strange thing is that the “for” is inside a condition so that only 1 thread should execute it and i should not have thead conflict. i tryed lo look at the ptx code but i didn’t understand what is going wrong. i just see that i use a little more registry when i let the compiler unroll the loop over the 59th iteration.
i really suspect some strange behavior of cuda…

to test the code:
nvcc cuda_md5.cu printf toto|md5sum
$ ./a.out f71dbe52628a3f83a77ab494817525c6a

if you let the compiler unroll the loop you will find that for every hash the code exit too early and never
copy back data from local memory.

ps: the incriminated loop is at line 160 of the attached source file
cuda_md5.txt (7.19 KB)