I’m doing some performance and algorithm optimization.
In addition to having better conditions, on paper my algorithm would run faster with tuned implementation of the 2D DCT (Discrete Cosine Transform). This is mostly motivated by needing to do a frequency shift on real valued input data.
I’m wondering if anybody knew of a tuned, supported version of the 2D DCT that I could put into my code in the same way people use cuFFTW. I understand how to roll my own, but I worry that it won’t be optimized.
I took a look at the npp and it seems that their DCTs are directed at JPEGs and specifically have fixed sizes, while I want to do a cosine a transform on an arbitrary (or large) size image in the same way I would use Matlab’s DCT command
For example, I see in NPP:
7.50.2.2 NppStatus nppiDCTInitAlloc (NppiDCTState** ppState)
Input is expected in 8x8 macro blocks and output is expected to be in 64x1 macro blocks.
What I want is:
http://www.mathworks.com/help/images/ref/dct2.html
B = dct2(A) returns the two-dimensional discrete cosine transform of A. The matrix B is the same size as A and contains the discrete cosine transform coefficients B(k1,k2).