How would I best implement the standard Discrete (or fast) Cosine Transform using CUDA? I am aware of the DCT8X8 example that is present in the SDK. However, I would like to perform a ‘full-field’ version of it (once on a 640X480 matrix), and not the one that works on local 8X8 blocks.
Has this been done before? What would be the best approach, direct implementation or some sort of adaptation of the CUFFT routines?
Thanks in advance,