Thank you for your reply. Specially for the Parallel Reduction WhitePaper… I’ll read it and certainly I’ll be able to solve my problem. :) My real problems with CUDA programming still resides in the dificulty of parallelizing serial ideas.
The images I’m dealing with are 752x480 pixel images. I need to extract features for the pixels analysing its neighborhood. So around each pixel of the image I’ll read information of a window of neighboring pixels of size 16x16 pixels. As I said, the features I’ll extract are the mean values of the pixels in these windows (for the tree color components, R, G and B) and its variances (that will give me a simple neighborhood texture information).
After this feature extraction procedure I’ll end up with a feature vector for each pixel. These vectors will feed a neural network for pixel classification.
The neural network is already implemented in the GPU by I got stuck in the feature extraction step so I implemented it in the CPU. But in the end I got only 1fps. :( So now I´m trying to do everything inside the GPU trying to speed up all the process.
I’ll write a kernel for this feature extraction step and post it here to show you if I’m doing good… :)