CUDA performance How to improve my code?

Hello all, I’m new in this forum, it’s my first post :rolleyes:

I’m new to CUDA programming too, and I’m get very interested in CUDA for image processing. My first use was for a simple Threshold from 24bits bitmap image.

My first .cu file is:

#ifndef _THRESHOLD_KERNEL_H_

#define _THRESHOLD_KERNEL_H_

// 24bits RGB

typedef struct

{

	unsigned char B;

	unsigned char G;

	unsigned char R;

} RGB24;

#define BLOCK_DIM 16

///////////////////////////////////////////////////////////////////////////////////////////////////

// Threshold filter with 24bits input

///////////////////////////////////////////////////////////////////////////////////////////////////

// *idata		-> Input/Output bitmap pointer

// width		-> bitmap width

// height		-> bitmap height

// stride		-> bitmap binary width

// threshold	-> threshold value, normalized from 0 to 25500 (so we avoid using floats)

// back		-> new color for < threshold values

// object		-> new color for > threshold values

///////////////////////////////////////////////////////////////////////////////////////////////////

extern "C" __global__ void thresholdGreyscale(unsigned char *idata, unsigned int width, unsigned int height, unsigned int stride, unsigned int threshold, RGB24 back, RGB24 object)

{

	// Calculating true X, Y

	unsigned int yIndex = (blockIdx.y * blockDim.y) + threadIdx.y;

	unsigned int xIndex = (blockIdx.x * blockDim.x) + threadIdx.x;

	

	// If X,Y are valid

	if ((xIndex < width) && (yIndex < height))

	{

		// Calculating byte array index

		int idx = (yIndex * stride) + (xIndex * 3);

		

		// Reading data (pointer)

		RGB24 *cor = (RGB24*)&idata[idx];

		

		// Verify threshold limit

		if ((cor->R * 30 + cor->G * 59 + cor->B * 11) > threshold)

		{

			// Write color for > threshold

			*cor = object;

		}

		else

		{

			// Write color for < threshold

			*cor = back;

		}

	}

}

#endif // _THRESHOLD_KERNEL_H_

*The INPUT is a byte-array, not a RGB24 array, because in memory a 24bits image/bitmap has a stride (how many bytes a line has, memory is blocks of 4bytes and bitmap is blocks of 3bytes).

**The INPUT is the same of OUTPUT.

This code is working, but is 2x slower than the CPU (Core2Quad, win7 64its, geforce8400gs, C# VisualStudio 2010 + CUDA.NET 3.0). Tested at a GTX285 too, but only a few ms faster than CPU. I tried out using shared BLOCK_DIM*BLOCK_DIM blocks, but still the same speed.

I know I’m doing something wrong, but don’t know what it is. Ideas?

Thanks,

Willian

AFAIK, if statements are slower than anything else, so one possibility is to replace the if statement with some maths. My suggestion is to replace the if with something like this:

int c = (cor->R * 30 + cor->G * 59 + cor->B * 11) - threshold + 1;

float fc = __int2float_rn(c);

float alpha = saturate(fc); //saturate on works on floats.  alpha = 0 if c <= 0 (i.e. color is less than threshold), and alpha = 1 if c > 0 (color is greater than threshold).

int ialpha = __float2int_rn(alpha);

int one_minus_ialpha = 1 - ialpha;

cor->R = ialpha * object->R + one_minus_ialpha * back.R;

cor->G = ialpha * object->G + one_minus_ialpha * back.G;

cor->B = ialpha * object->B + one_minus_ialpha * back.B;

I haven’t tried this myself, but I don’t think that this will cause divergent warps or have any conditional statements to slow the gpu. I am not sure how slow the int to float and float to int conversions are going to be. I think you can also use CUDA textures to have the hardware automatically do these conversions. According to the programming guide, textures can have a pitch, which seems similar to your stride variable.

I hope this helps.

Not exactly true. If there’s only one instruction under an if statement it should be implemented with only two native instructions (set predicate+conditional move in this case) which won’t cause divergence (all threads in a warp execute both). “It should” doesn’t mean it always will though.

Thanks wwa, I didn’t know that one. In this case though, would the assignment *cor = object be seen as one instruction or three, since RGB24 is a struct?

Most likely image size is too small for gpu and kernell laungh overhead is dominate. also the code could be optimized a bit, but this should not matter.

Thanks all folks, I’ll try this next monday since I’m traveling and my notebook doesn’t have cuda…

I’ll try using float to do the math instead of ifs, but I guess the textures will be more promising ;)

Lev is correct, this is likely dominated by transfer time (and launch overhead).

But still, you could make it faster, unfortunately by making it more complex.
In such a simple problem like this, computation is effectively free. You’re actually limited by memory access speed.
That limit might be the PCIe transfer of the image, or possibly by device memory speed.

What is the resolution of the image, and the actual run time? Divide the size of the image (in bytes) by the run time and get a GB/sec bandwidth.
If it’s on the order of 1GB/sec or more, you’re probably PCIe limited and that’s hard to optimize. (PCIe has a bandwidth of up to 6 GB/sec but you need to send both ways, and it’s unlikely you’re getting full utilization anyway.)

If you’re PCIe limited… well, your options aren’t great. There could be a factor of 2 or so if you do games like breaking the image up and using streams to compute one part while the next is transferring.

If you’re device memory speed limited (possible for low end cards like that 8400, probably unlikely for the GTX285) then you can optimize THAT a lot more.
The basic trick is to start making your GPU memory accesses more efficient and coalesced. With 24 bytes per thread, your reads and writes are not coalesced at all, even on the GTX285. The strategy to avoid this is likely to read a big block of perhaps 768 bytes from device memory into shared memory as a monolithic block, do your math and thresholding in that shared memory, then write out the pixels as a monolithic block again.

I just tried out some “tricks”, but none of then did the difference…

[font=“Lucida Console”]

PROCESS…TIME…EXPLANATION

Nothing…109ms…Just a CopyHostToDevice and a CopyDeviceToHost

Threshold (if)…468ms…Threshold using ‘if’ and ‘normalized (x100) int’

Threshold (float)…468ms…Threshold using ‘floats’ only

Threshold (shared)…468ms…Threshold using ‘if’, ‘normalized (x100) int’ and Shared-Memory

CPU…208ms…Threshold using ‘if’ and ‘float’ at CPU, using direct memory-access

CPU…52ms…Threshold using ‘if’ and ‘normalized (x100) int’ at CPU, using direct memory-access

[/font]

All tests were taken using a 3737x2450 24bits depth image, 16 BlockDIM size, @Geforce 8400gs

I don’t know how shared-memory would help here, without shared-memory we have:

READ-GLOBAL → math → WRITE-GLOBAL

and with shared-memory we have:

READ-GLOBAL > SHARED → math → SHARED > WRITE-GLOBAL

So we still have one READ and one WRITE global for each pixel, threshold filter bypass a pixel only once, I think shared-memory is great when there is more than one read and/or one write in global. I don’t know how texture-memory would improve, once it has a similar speed of shared-memory…

Codes:

HEAD:

[codebox]#ifndef THRESHOLD_KERNEL_H

#define THRESHOLD_KERNEL_H

// RGB 24bits structure

typedef struct

{

unsigned char B;

unsigned char G;

unsigned char R;

} RGB24;

#define BLOCK_DIM 16

///////////////////////////////////////////////////////////////////////////////////////////////////

// Threshold filter with 24bits input

///////////////////////////////////////////////////////////////////////////////////////////////////

// *idata → Input/Output bitmap pointer

// width → bitmap width

// height → bitmap height

// stride → bitmap binary width

// threshold → threshold value, normalized from 0 to 25500 (so we avoid using floats)

// back → new color for < threshold values

// object → new color for > threshold values

///////////////////////////////////////////////////////////////////////////////////////////////////

extern “C” global void thresholdGreyscale(unsigned char *idata, unsigned int width, unsigned int height, unsigned int stride, unsigned int threshold, RGB24 back, RGB24 object)

{

...

}

#endif // THRESHOLD_KERNEL_H[/codebox]

IF:

[codebox]{

// Calculating true X, Y

unsigned int yIndex = (blockIdx.y * blockDim.y) + threadIdx.y;

unsigned int xIndex = (blockIdx.x * blockDim.x) + threadIdx.x;

// If X,Y are valid

if ((xIndex < width) && (yIndex < height))

{

    // Calculating byte array index

    int idx = (yIndex * stride) + (xIndex * 3);

// Reading data (pointer)

    RGB24 *cor = (RGB24*)&idata[idx];

// Verify threshold limit

    if ((cor->R * 30 + cor->G * 59 + cor->B * 11) > threshold)

    {

        // Write color for > threshold

        *cor = object;

    }

    else

    {

        // Write color for < threshold

        *cor = back;

    }

}

}[/codebox]

Float:

[codebox]{

// Calculating true X, Y

unsigned int yIndex = (blockIdx.y * blockDim.y) + threadIdx.y;

unsigned int xIndex = (blockIdx.x * blockDim.x) + threadIdx.x;

// If X,Y are valid

if ((xIndex < width) && (yIndex < height))

{

    // Calculating byte array index

    int idx = (yIndex * stride) + (xIndex * 3);

// Reading data (pointer)

    RGB24 *cor = (RGB24*)&idata[idx];



// Calculating diff from threshold

float fc = __int2float_rn((cor->R * 30 + cor->G * 59 + cor->B * 11) - threshold);



// To alpha, 0 is back, 1 is object

float alpha = saturate(fc);



// Non-alpha

float nalpha = 1 - alpha;



// Seting colors

cor->R = object.R * alpha + back.R * nalpha;

cor->G = object.G * alpha + back.G * nalpha;

cor->B = object.B * alpha + back.B * nalpha;

}

}[/codebox]

Shared-memory:

[codebox]{

// Shared memory

__shared__ RGB24 block[BLOCK_DIM][BLOCK_DIM];

// Calculating X, Y

unsigned int yIndex = (blockIdx.y * blockDim.y) + threadIdx.y;

unsigned int xIndex = (blockIdx.x * blockDim.x) + threadIdx.x;



// If X,Y is valid

if ((xIndex < width) && (yIndex < height))

{

	// Calculating byte array index

	int idx = (yIndex * stride) + (xIndex * 3);

	// Copying to shared-memory

	block[threadIdx.y][threadIdx.x] = (*(RGB24*)&idata[idx]);

	__syncthreads();

	

	// Reading from shared-memory

	RGB24 cor = block[threadIdx.y][threadIdx.x];

	

	// Verify threshold limit

	if ((cor.R * 30 + cor.G * 59 + cor.B * 11) > threshold)

	{

		// Write color for > threshold

		(*(RGB24*)&idata[idx]) = object;

	}

	else

	{

		// Write color for < threshold

		(*(RGB24*)&idata[idx]) = back;

	}

}

}[/codebox]

Ideas? :unsure:

Thanks all,

What is your kernel configuration, block size?
Because of three different implementations give similar result, something is wrong.
Btw, because of just ransfer to gpu got 100ms, itiwill be slower than 52ms on cpu.
Btw, what is ms here? microsecond?

Blocksize = 16
ms is always for millisecond, microsecond is μs External Image (wikipedia)

Block size 16 is bad number, see manual.

You don’t have one read and write… you have multiple reads and writes due to noncoalesced memory access. Your code shows only one access, but that’s being broken up into multiple transactions since it’s unaligned.

Copying the pixels to shared memory first can be done with a coalesced read, wasting no bandwidth.

Writing would also be more efficient.

This is probably your bottleneck.

Program is totaly spoiled in many ways.

I tried out 32 too, only a few ms faster… A 64 block size I got an error External Image

Well, I don’t know about that, because I tested something like (with a black image):

[codebox]cor->R = cor->R + 50;

cor->G = cor->G + 50;

cor->B = cor->B + 50;[/codebox]

and I got only 50;50;50 pixels…

I tried out 32 too, only a few ms faster… A 64 block size I got an error External Image

Well, I don’t know about that, because I tested something like (with a black image):

[codebox]cor->R = cor->R + 50;

cor->G = cor->G + 50;

cor->B = cor->B + 50;[/codebox]

and I got only 50;50;50 pixels…

Should not get any error with this block size.
And your memory transfer time is too big. For only 25MB.

Should not get any error with this block size.
And your memory transfer time is too big. For only 25MB.