Possible compiler bug?

tim-timmy · March 20, 2008, 4:08pm

I’ve got a strange behavior when I try to optimize a little bit the speed of my program.

The program is a simple linear convolution on an unsigned char image (based on the convolutionseparable example from the sdk). In order to coalesce the memory access, each thread has to load and compute 4 pixels (instead of one in the provided example).

Here is the piece of code which performs the computation :

 

int coalescedresult=0; // this int will aggregate 4 uchar values in order to write them together to the global mem

  unsigned char * concat = (unsigned char*)&coalescedresult; // using the int as an array of 4 uchars

for(int i=0;i<4;i++){ //processing the 4 pixels

  	short sum = 0;//performing the convolution (horizontally here)

  	unsigned char* temp=&tile[smemPos-FILTER_WIDTH+i];//catching the adress of the (n-2)nd pixel

  	sum= *temp;// the kernel is 1 4 6 4 1

  	temp ++;//moving to the net pixel, ...

  	sum += *temp * 4;

  	temp ++;

  	sum += *temp * 6;

  	temp ++;

  	sum += *temp * 4;

  	temp ++;

  	sum += *temp;	

 	concat[i]=(unsigned char)sum;//writing down the result for the first pixel to the first place in coalescedresult

  }

  ((int*)result)[(rowStart + writePos)/4] = coalescedresult;//writing the whole int to global mem

So, the problem is that this code works very well in EmuDebug mode, but only computes the first pixel of each block of 4 pixels on the GPU.

After a day of research, I finally managed to have it working, simply adding a #pragma unroll 1 before my loop . I guess the compiler tried to unroll the loop and failed, but I can’t understand why

Any idea ?

chris22 · March 28, 2008, 7:03pm

Generally, if a piece of code works in emuDebug mode, but not on the gpu, then timing issues are to blame. You said it only computes the first pixel in each block. Are the other pixels garbage or are they 0 or some initialized values. My first thought would be some read/write ordering hazards with multiple warps or multiple thread blocks.

luigi · April 17, 2008, 1:34pm

hi,
i have a very similar problem so i try to write here instead of open a new post. i have a code (see attach) that copy a 64 byte long vector of unsigned char from local to global memory one element at time in a very simple “for” loop.
this code work right if i use “#pragma unroll 1” for the “for” loop of 64 iteration but fail during execution if i let the nvcc to unroll it. i also discovered that i can modify the loop and let the compiler unroll it until the 59th iter and it work right but if i compile the code with 60 iteration it fails (actually i need only the firsts 6/7 characters but i would like to understand what happen). the strange thing is that the “for” is inside a condition so that only 1 thread should execute it and i should not have thead conflict. i tryed lo look at the ptx code but i didn’t understand what is going wrong. i just see that i use a little more registry when i let the compiler unroll the loop over the 59th iteration.
i really suspect some strange behavior of cuda…

to test the code:
$ nvcc cuda_md5.cu
$ printf toto|md5sum
$ ./a.out f71dbe52628a3f83a77ab494817525c6a

if you let the compiler unroll the loop you will find that for every hash the code exit too early and never
copy back data from local memory.

ps: the incriminated loop is at line 160 of the attached source file
cuda_md5.txt (7.19 KB)

Topic		Replies	Views
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2641	October 25, 2017
Different output of code when not unrolling loop CUDA Programming and Performance	16	1253	August 22, 2022
compiler bug? CUDA Programming and Performance	4	1837	January 13, 2009
BUG? nvcc fails to unroll the loop CUDA Programming and Performance	6	6095	May 26, 2009
bug in loop? CUDA Programming and Performance	1	704	May 19, 2011
#pragma unroll not working? CUDA Programming and Performance	3	5008	June 8, 2009
unexpected unroll issue "unroll" changes output for the better CUDA Programming and Performance	3	3041	May 9, 2008
NVCC loop bug since cuda 5.5 CUDA Programming and Performance	5	1687	June 12, 2014
Coalescing issue, presumably due to the CUDA Optimizer CUDA Programming and Performance	18	3297	December 9, 2009
#pragma unroll CUDA Programming and Performance	20	5949	July 27, 2010

Possible compiler bug?

Related topics