currently i am working on an image processing app and try to improve performance of my algorithms. I hope someone can give me a hint or two, how to further improve my current algorithm.
I am processing 6 (320*240 RGBA) camera video streams concurrently , where each image is processed within a cudastream.
Currently my algorithms are limited to per pixel operations. I have tried kernels with different access patterns.
image in global mem, single pixel (uchar4) loaded into shared mem and one thread working on each pixel (avg. 1.3ms per 6 images on 8600GTS)
image in 2D texture, single pixel(uchar4) loaded into shared mem and one thread working on each pixel. (avg. 1.6ms)
When I was experimenting with textures I expected a performance gain, at least the sobel filter example which comes with the SDK let me assume this. But the numbers show that this is not the case.
My first question is therefore related to the correct use of textures as there is one thing which dissatisfies me.
When using textures (6 textures one in each stream) it looks like I have to bind/unbind each texture before/after the kernel call. Is there a possibility to just bind the texture once at initialisation time and invalidate the texture cache each time I upload a new image? And more important would this be beneficial?
Furthermore, I currently have to check in a large switch statement which texture I have to chose within the kernel, because I have not found a way to pass a reference to the texture. Is there a way to do it and how to do it?
Furthermore I was wondering if it could be better to pack the images together in one large image and process it instead. Has someone advice on that?
I think I would save some time for kernel launching but on the other hand would loose the possibility to use multiple streams.
Currently I am using just a small test kernel for colorconversion and each stream is limited to transfer data and launch this kernel. Between the use of streams and the use without there is currently no measurable performance difference in my setup. Therefore the question should I stay with streams are can I expect an noticeable performance gain when packing images together?
Finally a question concerning my adressing pattern as I am quite new to cuda and not shure if a got the docs right. Do I get into bank conflicts or uncoalesced reads with my adressing scheme.
Like already said I have 1 thread per uchar4 pixel. I calculate the image position by
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x; unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
and the store address within my shared block (1D array) with
LocalID = threadIdx.y * blockDim.x + threadIdx.x;
finally may launch configuration is given with blocksize(32,12,1) and grid(10,20,1) with shared mem of 1584 byte and 9 regs for the kernel with textures and 10 without. The occupancy calculator gives me a warp occupancy of 24.
Alright I hope I have not asked to much and to stupid staff, thanks everybody who read till here.