errors in modulo incorrect results when % is used for a ciculair buffer

I have a Quadro FX 1700 with Toolkit 2.1.1635 and SDK 2.10 whichs runs on Ubuntu 8.04 64 bit

for which i made an erossion filter. This filter uses one thread per vertical line and works just fine. But it prodcues incorrect results if introduce a ciculair buffer with modulo. In my test case(white image, black border) it results in black dots in the output image where the whole image should be completly white.

__global__ void cudaErodeLineLocal(unsigned char* input, unsigned char* output, unsigned int width, unsigned int height)

{

	unsigned int x = IMUL(blockIdx.x, (blockDim.x - 2)) + threadIdx.x;

	//unsigned int y = IMUL(blockIdx.y, blockDim.y) + threadIdx.y;

	__shared__ unsigned char localMem[162*3];

	

	//prefetch

	localMem[blockDim.x*0+threadIdx.x] = input[0 * width + x];

	localMem[blockDim.x*1+threadIdx.x] = input[1 * width + x];

	

	for(unsigned int y = 1; y < height-1; y++  )

	{

		localMem[blockDim.x*((y+1)%3)+threadIdx.x] = input[(y+1)*width + x]; //next line

		

		if ((threadIdx.x == 0) || (threadIdx.x == 161))

		{

			if ( (x == 0) )

			{

				output[y*width + x] = 0;

			}

		}

		else

		{

		  output[y*width + x] = ((

			   localMem[blockDim.x * ((y-1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y-1)%3) + threadIdx.x] + localMem[blockDim.x * ((y-1)%3) + threadIdx.x+1] 

			 + localMem[blockDim.x * ((y  )%3) + threadIdx.x-1] + localMem[blockDim.x * ((y  )%3) + threadIdx.x] + localMem[blockDim.x * ((y  )%3) + threadIdx.x+1]

			 + localMem[blockDim.x * ((y+1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y+1)%3) + threadIdx.x] + localMem[blockDim.x * ((y+1)%3) + threadIdx.x+1])  == 2295 )? 255 : 0; 

		}

	}

}

placeing the localMem array in global memory doesn’t solve the problem.

The output varies if the modulo operator is changed, i’ve inlcuded some examples of %3 %32 %64

if the modulo operator is bigger as the image than is the output correct aswell

You need to synchronize the threads in the block to make sure the new data in localMem is already available:

__global__ void cudaErodeLineLocal(unsigned char* input, unsigned char* output, unsigned int width, unsigned int height)

{

        unsigned int x = IMUL(blockIdx.x, (blockDim.x - 2)) + threadIdx.x;

        //unsigned int y = IMUL(blockIdx.y, blockDim.y) + threadIdx.y;

        __shared__ unsigned char localMem[162*3];

//prefetch

        localMem[blockDim.x*0+threadIdx.x] = input[0 * width + x];

        localMem[blockDim.x*1+threadIdx.x] = input[1 * width + x];

for(unsigned int y = 1; y < height-1; y++  )

        {

localMem[blockDim.x*((y+1)%3)+threadIdx.x] = input[(y+1)*width + x]; //next line

                _syncthreads();

if ((threadIdx.x == 0) || (threadIdx.x == 161))

                {

                        if ( (x == 0) )

                        {

                                output[y*width + x] = 0;

                        }

                }

                else

                {

                  output[y*width + x] = ((

                           localMem[blockDim.x * ((y-1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y-1)%3) + threadIdx.x] + localMem[blockDim.x * ((y-1)%3) + threadIdx.x+1] 

                         + localMem[blockDim.x * ((y  )%3) + threadIdx.x-1] + localMem[blockDim.x * ((y  )%3) + threadIdx.x] + localMem[blockDim.x * ((y  )%3) + threadIdx.x+1]

                         + localMem[blockDim.x * ((y+1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y+1)%3) + threadIdx.x] + localMem[blockDim.x * ((y+1)%3) + threadIdx.x+1])  == 2295 )? 255 : 0; 

                }

                _syncthreads();

        }

}

Apart from that, modulo is an expensive operation and is best avoided. I don’t know if the compiler is smart enough to eliminate the modulo if you insert a [font=“Courier New”]#pragma unroll 3[/font] before the loop.

32-bit integer modulo is not necessarily a very expensive operation. The compiler includes optimizations for 32-bit division and modulo with a constant divisor. Depending on the value of the constant divisor and the signedness of the operands this results in anywhere from 1 to about 8 machine instructions, of memory serves. cuobjdump can be used to check what gets generated for a particular case.

On sm_2x platforms, i.e. Fermi-class GPUs, even 32-bit integer modulo with a variable divisor is not all that expensive, about 17 instructions of inline code, as I recall (again, cuobjdump will show exactly what is being generated for a particular case). In other words, on Fermi the relative cost of 32-bit modulo compared to say, 32-bit integer add, is comparable to what one would encounter on a CPU. Therefore I would not recommend that programmers go out of their way to avoid 32-bit modulo if they don’t do this in their corresponding CPU code.

If it turns out that some code is dominated by the cost of modulo operations it’s of course worth thinking about alternatives, but that’s true for all platforms I am familiar with (including x86, PowerPC, SPARC, ARM), and not particular to GPUs.

[Later:]

I set up a small test app which I compiled with the CUDA 4.0 toolchain. I count 17 instructions for 32-bit unsigned integer modulo with variable divisor, 20 instructions for the 32-bit signed integer modulo with variable divisor.

Oh, I didn’t want to imply that Nvidia’s modulo implementation is inefficient in any way (it’s not). It’s just that with CUDA I’m always in full optimization mode (if we don’t care about speed, why run it on the GPU at all?), and when I see a handful of machine instructions that is easily avoided, I’d like to point it out:

__global__ void cudaErodeLineLocal(unsigned char* input, unsigned char* output, unsigned int width, unsigned int height)

{

    unsigned int x = IMUL(blockIdx.x, (blockDim.x - 2)) + threadIdx.x;

    //unsigned int y = IMUL(blockIdx.y, blockDim.y) + threadIdx.y;

    __shared__ unsigned char localMem[162*3];

//prefetch

    localMem[blockDim.x*0+threadIdx.x] = input[0 * width + x];

    localMem[blockDim.x*1+threadIdx.x] = input[1 * width + x];

// outer loop: keeps (y3 % 3) invariant:

    for (unsigned int y3 = 1; y3 < height-1; y3 += 3)

    {

        // inner loop: unroll to allow evaluation of (y1 % 3) at compile time:

#pragma unroll

        for (unsigned int y1 = 0; y1 < 3; y1++)

        {

            unsigned int y = y3+y1;

if (y < height-1)

            {

localMem[blockDim.x*((y1+2)%3)+threadIdx.x] = input[(y+1)*width + x]; //next line

                __syncthreads();

if ((threadIdx.x == 0) || (threadIdx.x == 161))

                {

                    if ( (x == 0) )

                    {

                        output[y*width + x] = 0;

                    }

                }

                else

                {

                    output[y*width + x] = ((

                             localMem[blockDim.x * ((y1  )%3) + threadIdx.x-1] + localMem[blockDim.x * ((y1  )%3) + threadIdx.x] + localMem[blockDim.x * ((y1  )%3) + threadIdx.x+1]

                           + localMem[blockDim.x * ((y1+1)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y1+1)%3) + threadIdx.x] + localMem[blockDim.x * ((y1+1)%3) + threadIdx.x+1]

                           + localMem[blockDim.x * ((y1+2)%3) + threadIdx.x-1] + localMem[blockDim.x * ((y1+2)%3) + threadIdx.x] + localMem[blockDim.x * ((y1+2)%3) + threadIdx.x+1])  == 2295 )? 255 : 0;

                }

                __syncthreads();

            }

        }

    }

}

Of course it’s everybody’s own call whether the speedup is worth the increase in code complexity.

Thanks njuffa and tera for your reply,

__syncthreads();

solved the problem.

modulo 3 is replaced with several instructions (as njuffa wrote)
modulo 4 is recognized by the compiler and changed to a single and instruction.

i noticed the same when i did a bitwise and of the 9 inputs of the erosion filter, each bitwise-and on a unsigned char resulted in 3 operations. therefore I’ve changed it into an add and one compare.