improving algorithm performance

Hello together,
currently i am working on an image processing app and try to improve performance of my algorithms. I hope someone can give me a hint or two, how to further improve my current algorithm.
I am processing 6 (320*240 RGBA) camera video streams concurrently , where each image is processed within a cudastream.
Currently my algorithms are limited to per pixel operations. I have tried kernels with different access patterns.

  1. image in global mem, single pixel (uchar4) loaded into shared mem and one thread working on each pixel (avg. 1.3ms per 6 images on 8600GTS)

  2. image in 2D texture, single pixel(uchar4) loaded into shared mem and one thread working on each pixel. (avg. 1.6ms)

When I was experimenting with textures I expected a performance gain, at least the sobel filter example which comes with the SDK let me assume this. But the numbers show that this is not the case.
My first question is therefore related to the correct use of textures as there is one thing which dissatisfies me.
When using textures (6 textures one in each stream) it looks like I have to bind/unbind each texture before/after the kernel call. Is there a possibility to just bind the texture once at initialisation time and invalidate the texture cache each time I upload a new image? And more important would this be beneficial?
Furthermore, I currently have to check in a large switch statement which texture I have to chose within the kernel, because I have not found a way to pass a reference to the texture. Is there a way to do it and how to do it?

Furthermore I was wondering if it could be better to pack the images together in one large image and process it instead. Has someone advice on that?
I think I would save some time for kernel launching but on the other hand would loose the possibility to use multiple streams.
Currently I am using just a small test kernel for colorconversion and each stream is limited to transfer data and launch this kernel. Between the use of streams and the use without there is currently no measurable performance difference in my setup. Therefore the question should I stay with streams are can I expect an noticeable performance gain when packing images together?

Finally a question concerning my adressing pattern as I am quite new to cuda and not shure if a got the docs right. Do I get into bank conflicts or uncoalesced reads with my adressing scheme.

Like already said I have 1 thread per uchar4 pixel. I calculate the image position by

unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

and the store address within my shared block (1D array) with
LocalID = threadIdx.y * blockDim.x + threadIdx.x;

finally may launch configuration is given with blocksize(32,12,1) and grid(10,20,1) with shared mem of 1584 byte and 9 regs for the kernel with textures and 10 without. The occupancy calculator gives me a warp occupancy of 24.

Alright I hope I have not asked to much and to stupid staff, thanks everybody who read till here.

Cheers
Christoph

If all you’re doing is color conversion, I assume that an output pixel needs exactly one input pixel. So, since there’s no need to shared data among threads, I don’t see why you would need to read data into shared memory. Just operate on unqualified variables (registers).

Kernels launched in different streams will get serialized - CUDA executes exactly one kernel at a time. Currently streams are useful for async operation with the CPU, as well as overlapping memcopies with kernel executions.

Are all your reads/writes coalesced? Run your program through the visual profiler. The profiler would also indicate whether you’re getting smem bank conflicts (though, as I said before, use regisers unless you’re communicating between threads or saving multiple reads).

Paulius

Hello,

it is true that I need always exactly one input pixel. The first kernel I have tested did not use any shared memory at all.

However when loading the input pixels into shared mem first, I get a performance increase of approx 50%. I do not have any explanation for that as I thougt the same like you. I attach the code for the kernels here possibly you have an idea what is going wrong.

// without shared mem

{

int x = blockIdx.x * blockDim.x + threadIdx.x;

int y = blockIdx.y * blockDim.y + threadIdx.y;



if(x < uiWidth && y < uiHeight)

{

	uchar4 pix = pSrc[y * (uibSrcPitch >> 2) + x];

	////////////////// color conversion /////////////////////

	unsigned uiDif = pix.x + pix.y + pix.z;

	

	if(uiDif > uiIntensityThresh)

	{

		pix.x = (pix.x * 255) / uiDif;

		pix.y = (pix.y * 255) / uiDif;

	}

	else

		pix.x = pix.y = 0;

	

	pix.z = 0;

	////////////////////////////////////////////////////////

	pDst[y * (uibDstPitch >> 2) + x] = pix;

}

}

// with shared mem

{

unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;

unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

if(x < uiWidth && y < uiHeight)

{

	unsigned int idx = threadIdx.y *  blockDim.x + threadIdx.x;

	LocalBlock[idx] = pSrc[y * (uibSrcPitch >> 2) + x];

	__syncthreads();

	////////////////// color conversion /////////////////////

	unsigned uiDif =  LocalBlock[idx].x +  LocalBlock[idx].y + LocalBlock[idx].z;

	if(uiDif > uiIntensityThresh)

	{

		LocalBlock[idx].x = ( LocalBlock[idx].x * 255) / uiDif;

		LocalBlock[idx].y = ( LocalBlock[idx].y * 255) / uiDif;

	}

	else

		LocalBlock[idx].x = LocalBlock[idx].y = 0;

	LocalBlock[idx].z = 0;

	////////////////////////////////////////////////////////

	__syncthreads();

	pDst[y * (uibDstPitch >> 2) + x] = LocalBlock[idx];

}

}

Furthermore I have a question concerning the use of the visual profiler. Do I have to instrument the kernel code first, and what is the nvcc option for this. As soon as I enable the Signal list in SessionSettings/Configuration I get the following error message:

Error -94 in reading profiler output.

Minimum expected columns(method,gputime,cputime,occupancy) not found in profiler output file.

Thanks for your help

Christoph