Hello,
I’m starting to work with CUDA to see if I can improve the speed of one of my algorithms. I’ve started with the DrDobb’s tutorials which show how easy, fantastic and fast CUDA is. I’ve played with some examples and proceeded to try coding something (useless, but just to see if I could manage it).
As it turns out it was easy, fantastic and… well… it runs 20% slower than the CPU version, which clearly indicates that I’m doing something wrong. After playing with cudaprof and seeing a huge numbers in the “uncoalesced” columns it’s quite clear that my small program basically spends all its time waiting for memory accesses. So I went back to the programming guide to get a better idea of how things work, but before I jump in and try coding the real problem I’d like to make sure that I’m undestanding things right.
What I want to write is basically an image filter (B/W 1 byte per pixel, 1600x1200), which generates a value for each pixel by operating on the pixel itself and its neighbors (a 5x5 area centerd on the pixel itself). For the processing I need a lot of external data, which is pixel-dependent and stored in 1600x1200 matrices (much of is is precalculated stuff to speed up calculations in the CPU, right now I have 11 floats per pixel).
Without killing you with all the details (which I can always do later), what I’d like to know is:
-
when writing this kind of filters, is it better to work pixel-by-pixel (= 1 pixel per thread) or line-by-line (= each thread processes one full line or a fraction of it, the idea is to reuse loaded data as much as possible)? Am I understanding right that using texture memory would be better than global memory for the read-only image/external data which I need to use? Or should I use global and try to cache as much as possible in shared and organize my blocks so that they match the amount I can store in shared?
-
the documentation indicates that memory reads incur in a huge penalty (400-600) cycles. Does this mean that if instead of loading from memory I can recalculate the values in less than the 400-600 cycles I’ll be better off? For example, right now I have two float tables which encode the real/imaginary part of a unit vector, I could cut the memory accesses by two by just loading the phase and recalculating sin/cos every time.
Thanks in advance.