errors in modulo incorrect results when % is used for a ciculair buffer

zehn_nl · July 20, 2011, 6:28pm

I have a Quadro FX 1700 with Toolkit 2.1.1635 and SDK 2.10 whichs runs on Ubuntu 8.04 64 bit

for which i made an erossion filter. This filter uses one thread per vertical line and works just fine. But it prodcues incorrect results if introduce a ciculair buffer with modulo. In my test case(white image, black border) it results in black dots in the output image where the whole image should be completly white.

__global__ void cudaErodeLineLocal(unsigned char* input, unsigned char* output, unsigned int width, unsigned int height)

{

	unsigned int x = IMUL(blockIdx.x, (blockDim.x - 2)) + threadIdx.x;

	//unsigned int y = IMUL(blockIdx.y, blockDim.y) + threadIdx.y;

	__shared__ unsigned char localMem[162*3];

	

	//prefetch

	localMem[blockDim.x*0+threadIdx.x] = input[0 * width + x];

	localMem[blockDim.x*1+threadIdx.x] = input[1 * width + x];

	

	for(unsigned int y = 1; y < height-1; y++  )

	{

		localMem[blockDim.x*((y+1)%3)+threadIdx.x] = input[(y+1)*width + x]; //next line

		

		if ((threadIdx.x == 0) || (threadIdx.x == 161))

		{

			if ( (x == 0) )

			{

				output[y*width + x] = 0;

			}

		}

		else

		{

		  output[y*width + x] = ((

			   localMem[blockDim.x * ((y-1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y-1)%3) + threadIdx.x] + localMem[blockDim.x * ((y-1)%3) + threadIdx.x+1] 

			 + localMem[blockDim.x * ((y  )%3) + threadIdx.x-1] + localMem[blockDim.x * ((y  )%3) + threadIdx.x] + localMem[blockDim.x * ((y  )%3) + threadIdx.x+1]

			 + localMem[blockDim.x * ((y+1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y+1)%3) + threadIdx.x] + localMem[blockDim.x * ((y+1)%3) + threadIdx.x+1])  == 2295 )? 255 : 0; 

		}

	}

}

placeing the localMem array in global memory doesn’t solve the problem.

The output varies if the modulo operator is changed, i’ve inlcuded some examples of %3 %32 %64

if the modulo operator is bigger as the image than is the output correct aswell

External Media

tera · July 20, 2011, 9:17pm

You need to synchronize the threads in the block to make sure the new data in localMem is already available:

__global__ void cudaErodeLineLocal(unsigned char* input, unsigned char* output, unsigned int width, unsigned int height)

{

        unsigned int x = IMUL(blockIdx.x, (blockDim.x - 2)) + threadIdx.x;

        //unsigned int y = IMUL(blockIdx.y, blockDim.y) + threadIdx.y;

        __shared__ unsigned char localMem[162*3];

//prefetch

        localMem[blockDim.x*0+threadIdx.x] = input[0 * width + x];

        localMem[blockDim.x*1+threadIdx.x] = input[1 * width + x];

for(unsigned int y = 1; y < height-1; y++  )

        {

localMem[blockDim.x*((y+1)%3)+threadIdx.x] = input[(y+1)*width + x]; //next line

                _syncthreads();

if ((threadIdx.x == 0) || (threadIdx.x == 161))

                {

                        if ( (x == 0) )

                        {

                                output[y*width + x] = 0;

                        }

                }

                else

                {

                  output[y*width + x] = ((

                           localMem[blockDim.x * ((y-1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y-1)%3) + threadIdx.x] + localMem[blockDim.x * ((y-1)%3) + threadIdx.x+1] 

                         + localMem[blockDim.x * ((y  )%3) + threadIdx.x-1] + localMem[blockDim.x * ((y  )%3) + threadIdx.x] + localMem[blockDim.x * ((y  )%3) + threadIdx.x+1]

                         + localMem[blockDim.x * ((y+1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y+1)%3) + threadIdx.x] + localMem[blockDim.x * ((y+1)%3) + threadIdx.x+1])  == 2295 )? 255 : 0; 

                }

                _syncthreads();

        }

}

Apart from that, modulo is an expensive operation and is best avoided. I don’t know if the compiler is smart enough to eliminate the modulo if you insert a [font=“Courier New”]#pragma unroll 3[/font] before the loop.

njuffa · July 20, 2011, 9:51pm

32-bit integer modulo is not necessarily a very expensive operation. The compiler includes optimizations for 32-bit division and modulo with a constant divisor. Depending on the value of the constant divisor and the signedness of the operands this results in anywhere from 1 to about 8 machine instructions, of memory serves. cuobjdump can be used to check what gets generated for a particular case.

On sm_2x platforms, i.e. Fermi-class GPUs, even 32-bit integer modulo with a variable divisor is not all that expensive, about 17 instructions of inline code, as I recall (again, cuobjdump will show exactly what is being generated for a particular case). In other words, on Fermi the relative cost of 32-bit modulo compared to say, 32-bit integer add, is comparable to what one would encounter on a CPU. Therefore I would not recommend that programmers go out of their way to avoid 32-bit modulo if they don’t do this in their corresponding CPU code.

If it turns out that some code is dominated by the cost of modulo operations it’s of course worth thinking about alternatives, but that’s true for all platforms I am familiar with (including x86, PowerPC, SPARC, ARM), and not particular to GPUs.

[Later:]

I set up a small test app which I compiled with the CUDA 4.0 toolchain. I count 17 instructions for 32-bit unsigned integer modulo with variable divisor, 20 instructions for the 32-bit signed integer modulo with variable divisor.

tera · July 20, 2011, 10:38pm

Oh, I didn’t want to imply that Nvidia’s modulo implementation is inefficient in any way (it’s not). It’s just that with CUDA I’m always in full optimization mode (if we don’t care about speed, why run it on the GPU at all?), and when I see a handful of machine instructions that is easily avoided, I’d like to point it out:

__global__ void cudaErodeLineLocal(unsigned char* input, unsigned char* output, unsigned int width, unsigned int height)

{

    unsigned int x = IMUL(blockIdx.x, (blockDim.x - 2)) + threadIdx.x;

    //unsigned int y = IMUL(blockIdx.y, blockDim.y) + threadIdx.y;

    __shared__ unsigned char localMem[162*3];

//prefetch

    localMem[blockDim.x*0+threadIdx.x] = input[0 * width + x];

    localMem[blockDim.x*1+threadIdx.x] = input[1 * width + x];

// outer loop: keeps (y3 % 3) invariant:

    for (unsigned int y3 = 1; y3 < height-1; y3 += 3)

    {

        // inner loop: unroll to allow evaluation of (y1 % 3) at compile time:

#pragma unroll

        for (unsigned int y1 = 0; y1 < 3; y1++)

        {

            unsigned int y = y3+y1;

if (y < height-1)

            {

localMem[blockDim.x*((y1+2)%3)+threadIdx.x] = input[(y+1)*width + x]; //next line

                __syncthreads();

if ((threadIdx.x == 0) || (threadIdx.x == 161))

                {

                    if ( (x == 0) )

                    {

                        output[y*width + x] = 0;

                    }

                }

                else

                {

                    output[y*width + x] = ((

                             localMem[blockDim.x * ((y1  )%3) + threadIdx.x-1] + localMem[blockDim.x * ((y1  )%3) + threadIdx.x] + localMem[blockDim.x * ((y1  )%3) + threadIdx.x+1]

                           + localMem[blockDim.x * ((y1+1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y1+1)%3) + threadIdx.x] + localMem[blockDim.x * ((y1+1)%3) + threadIdx.x+1]

                           + localMem[blockDim.x * ((y1+2)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y1+2)%3) + threadIdx.x] + localMem[blockDim.x * ((y1+2)%3) + threadIdx.x+1])  == 2295 )? 255 : 0;

                }

                __syncthreads();

            }

        }

    }

}

Of course it’s everybody’s own call whether the speedup is worth the increase in code complexity.

zehn_nl · July 21, 2011, 7:43am

Thanks njuffa and tera for your reply,

__syncthreads();

solved the problem.

modulo 3 is replaced with several instructions (as njuffa wrote)
modulo 4 is recognized by the compiler and changed to a single and instruction.

i noticed the same when i did a bitwise and of the 9 inputs of the erosion filter, each bitwise-and on a unsigned char resulted in 3 operations. therefore Iâ€™ve changed it into an add and one compare.

Topic		Replies	Views
CUDA & modulos CUDA Programming and Performance	1	1669	February 28, 2010
Handling of uint and the modulo operator CUDA Programming and Performance	4	7221	March 25, 2010
error in modulo operation CUDA Programming and Performance	12	16258	September 20, 2009
Possible bug with unsigned 64 bit int modulo CUDA Programming and Performance	8	9566	July 14, 2009
Inst_fp_32 and inst_fp_64 metrics CUDA Programming and Performance	9	1730	April 7, 2018
Using Modulus in Cuda Can't find syntax for modulus in Cuda CUDA Programming and Performance	5	7623	July 5, 2011
Modulo Cost CUDA Programming and Performance	3	2280	April 3, 2009
Odd code exits kernel CUDA Programming and Performance	10	4242	July 16, 2008
Need help writing a CUDA kernel for image shifting CUDA Programming and Performance	16	1692	June 16, 2022
Compiler bug? CUDA Programming and Performance	2	1230	July 24, 2009

errors in modulo incorrect results when % is used for a ciculair buffer

Related topics