Sorry this is not the answer u are expecting. But Instead i’am asking you a doubt.
How to use this NPP library with CUDA.
Where should we call the NPP functions? can we call it inside Kernel definition.
My requirement is for performing affine transformation on an image.
Did you find out how to perform multiple MCU rows at at time?
I setup the ROI in the same way that you have except that I was only tying to perform the DCT for 4 MCU rows at a time i.e. ROI width was set to 1280 & height was set to 4 * 8 (assuming 444 sampling).
In order to verify that it didn’t work I performed a cuda memset on the memory to 0 before calling the DCT function. The result I get is that the only every 4th MCU line down the image appears to be correct when I view the ouptut jpeg file.
I am using CUDA 4.1 RC2 on Ubuntu 10.04.2 with a GTX550 Ti graphics card & Q9650 cpu.
If I setup the ROI to just encode one MCU row at a time then I get a correct image out, however this does not give me enough performance (nvvp shows the graphics card as mostly idle with a lot of time spent in the API).
Is this a bug in the NPP JPEG forward DCT? Or is there something differenct I need to specifiy in the src step etc?
Did you find out how to perform multiple MCU rows at at time?
I setup the ROI in the same way that you have except that I was only tying to perform the DCT for 4 MCU rows at a time i.e. ROI width was set to 1280 & height was set to 4 * 8 (assuming 444 sampling).
In order to verify that it didn’t work I performed a cuda memset on the memory to 0 before calling the DCT function. The result I get is that the only every 4th MCU line down the image appears to be correct when I view the ouptut jpeg file.
I am using CUDA 4.1 RC2 on Ubuntu 10.04.2 with a GTX550 Ti graphics card & Q9650 cpu.
If I setup the ROI to just encode one MCU row at a time then I get a correct image out, however this does not give me enough performance (nvvp shows the graphics card as mostly idle with a lot of time spent in the API).
Is this a bug in the NPP JPEG forward DCT? Or is there something differenct I need to specifiy in the src step etc?
I’m not familiar with your problem, but I have been able to use the forward and inverse JPEG DCT functions successfully using CUDA 4.0. I posted some proof-of-concept code in another post, but it was for CUDA 3.2: http://forums.nvidia.com/index.php?showtopic=191896&st=0&p=1185259&
If you’re still having trouble, I can post my functional CUDA 4.0 code, but if i recall, it’s the same.
The issue I was having was just around the ROI for the forward DCT. I can process an ROI with a size of 8 x 8 or 1280 x 8 ok, only when I specify 1280 (width) x 32 (height) that things do not work correctly.
When the ROI is set for 1280 x 32 I only get valid DCTs output for the first quarter (1280 x 8) of the source image, the rest are not calculated, but there is no error. The source image I am using is 1280 x 960. Thus I can only process one MCU row at at time. This does not give me enough performance.
This problem may be spcific to the 4.1 RC or to Ubuntu 10.04 64 bit.
Unless your functional CUDA 4.0 code uses an ROI with a height > 8 I don’t think it will help me. I am surprised that you get enough performance from the card when only processing each 8 x 8 region separately.
PS. The quant table setups seem fixed in 4.1, I checked the value output. Your proof of concept sample code was very useful as a starting point when I first started looking at this.
For my application, I’m converting a C++ program to use NPP and some custom CUDA which operates on gray scale images. I load a jpeg as coefficients, inverse DCT it into pixels, then forward DCT the pixels cropped by 4 on each side (which is the reason for the offset in the d_Pixels_row array) into coefficients: