I re-read this thread twice and failed to understand: why not program this problem so that each thread calculates one or several pixels in the result image A, rather than the contribution to that pixel from the source image B? In other words, why not anchor the threads to pixels of result image A rather than the source image B? This seems to eliminate the issue of thread synchronization.
Isn’t this essentially how GPU-based matrix multiplication works?
I re-read this thread twice and failed to understand: why not program this problem so that each thread calculates one or several pixels in the result image A, rather than the contribution to that pixel from the source image B? In other words, why not anchor the threads to pixels of result image A rather than the source image B? This seems to eliminate the issue of thread synchronization.
Isn’t this essentially how GPU-based matrix multiplication works?
In fact, I just solved this problem in the way you say. I found other way to do what I needed, using each thread to calculate the result in A from the data at B.
Obviously this is the way for doing this kind of things…
Perhaps the use of Atomic functions should work too. I’m working in that right now.
In fact, I just solved this problem in the way you say. I found other way to do what I needed, using each thread to calculate the result in A from the data at B.
Obviously this is the way for doing this kind of things…
Perhaps the use of Atomic functions should work too. I’m working in that right now.
In fact, I just solved this problem in the way you say. I found other way to do what I needed, using each thread to calculate the result in A from the data at B.
Obviously this is the way for doing this kind of things…
Perhaps the use of Atomic functions should work too. I’m working in that right now.
In fact, I just solved this problem in the way you say. I found other way to do what I needed, using each thread to calculate the result in A from the data at B.
Obviously this is the way for doing this kind of things…
Perhaps the use of Atomic functions should work too. I’m working in that right now.
As far as I remember, there was an example implementation of convolution in the SDK or floating around the Web. Have you taken a look? It might be illustrating some relevant techniques, such as texture fetches.
As far as I remember, there was an example implementation of convolution in the SDK or floating around the Web. Have you taken a look? It might be illustrating some relevant techniques, such as texture fetches.
It’s a bit old by now and some of the things can be done better even on a gt200, but it’s a start.
generally, on any parallel platform, you want to avoid atomic operations whenever possible. If you can write a pixel (or preferably several actually) per thread, without collaboration it’s generally better (i.e spread the work by output pixels rather than input pixels).
Note also that you want to avoid random memory accesses in global memory as much as possible ass well.
A simple example of where threads have to write to the same memory location is histograms, it’s worth looking into that example as well to get some more incites. It’s in the SDK with slides.
It’s a bit old by now and some of the things can be done better even on a gt200, but it’s a start.
generally, on any parallel platform, you want to avoid atomic operations whenever possible. If you can write a pixel (or preferably several actually) per thread, without collaboration it’s generally better (i.e spread the work by output pixels rather than input pixels).
Note also that you want to avoid random memory accesses in global memory as much as possible ass well.
A simple example of where threads have to write to the same memory location is histograms, it’s worth looking into that example as well to get some more incites. It’s in the SDK with slides.