Hi,
new to cuda… I’m just studying the platform and make some example on it.
So I’m making a cuda program , for the moment, that load an image and it apply on a median/middle filter.
I made the program without cuda and now I’m working on make it works with it.
I think it should look like this:
1 load an image and put into an array in shared memory (inputArray).
2 make a second array (outputArray) that it will contain the result.
3 execute the kernel code
4 every GPU thread compute on a region of the array (few elements of inputArray)
5 makes some calculation on thread and put the result on the outputArray.
6 Read the outputArray and write on image.
I don’t know if it is the right way to proceed but any comment on this will be appreciated.
I’m thinking to make a local copy per thread in the kernel code… so it shuld be a new point:
4.5 put the elements of the region on the local memory for the thread
What do you think about this?
What do you think about everything?
Thanks a lot to anyone who read and reply to this post.
I think I was not so clear… so I try again. I’m sorry…
I have a 2Darray structure and I want that every kernel can operate on a little region of this 2Darray.
So I think that I should put my array in a Texture or Global Memory and than call the kernel.
I think this is not the best way to proceed… many access to the same memory from every thread can slow the process.
So I’m asking for advice about it.
One way , I think, is to make local copy of the region inside the kernel, make some calculation and put the result back on the global memory.
In pseudocode…
[b]global float** 2DArray;
InitCudaAndStructure();
callKernelCode(2DArray);
…
and inside the kernel
callKernelCode(2DArray){
float *region;
…
copyRegion(idThread,region);
someCalculation();
copyBack(idThread,region);
}[/b]
where
copyRegion(…) copy elements from the global array to the kernel local array.
and
copyBack(…) copy the region back to its position in the global memory.
Thanks to all and sorry for my bad english!
I’m talking with myself like crazy man… :(
What’s the problem?
Hi!
I think your idea is basically right, I would just add some suggestions.
First of all, you should consider what kind of method you will use for median filter computing. Basically, there are two kinds of methods: region (or sort) based (which are straightforward) and histogram based. The former find the middle intensity in neighbouring region of each pixel using some kind of sorting method, the latter find it (re-)constructing the histogram of region. The former can be used generally on all kind of data of any precision computer can held, the latter are memory-consuming and are usually used only on 8bit integer data (see e.g. [url=“Median Filtering in Constant Time”]http://nomis80.org/ctmf.html[/url], one of the fastest implementation, including c source).
In my opinion, whether you choose the first or the second kind of algorithm, the key to fast implementation would be how to make a local copy of data for fast processing. The straightforward approach is to make a local copy of neighbouring pixels, find the middle intensity, then discard the local copy, move a pixel and make a new local copy. This is, of course, very inefficient, because generaly, two neighbouring pixels have a lot of common neighbouring pixels. If you process neighbourhood of one pixel, you should discard only those pixels that won’t be needed in further processing. So, you should keep those needed and add new pixels to them.
Furthermore, the effectivity of reading and writing data is crucial. I think storing input image in texture memory could come in handy. The output image would be probably stored in global memory so memory accesses should be coalesced when writing output.