video processing Advices

Good afternoon, i’m in a project that runs in sequential mode (CPU only) and in grid (still, GPU only :P)

my task is to use CUDA to parallelize one part which contains several subparts, such as:

  • decoding mjpeg (which is being done with ffmpeg linux libs (libavcodec, libavformat,etc…). this part is done package by package (and consists in retrieving each frame inside packages of main video stream)
  • after that, convert the acquired frame from YUV(i believe) to RGB
  • build frame histogram, normalize the histogram and than, quantize it
  • after that, i need to segment the resultant frame into 4 categories (1,2,3 or 4), depending on color properties

and it is repeated from each and every frame of the video.

I’m posting this to ask for some advice! how can i decode mjpeg in cuda (each thread decodes a frame, and than, post-process it)? i realized that nvcuvid only decodes mpeg 1/2 and h264… so it seems that there isn’t any easy way to do this?

I would appreciate any lead,… because to make this worth the work, i need to make it all inside GPU.

I think you may ignore the first step at this point and proceed with 2,3 and 4 which should give you significant speedup using CUDA

hi there.

I’ve started by doing the histogram part (or at leat, trying to)… and I’ve tried unsuccessfully in 2 different ways:

1st approach:

was to reuse the histogram code example presented in SDK. I’ve tried to adapt it into my problem, compiled and executed it. the problem I encountered was that in the given example, the 256bins histogram is computed from an “image” (random values, actually) which are like a gray scale image (continuous values). But I have a RGB color image and I’ve failed to adapt it. why? because by comparing the results fom GPU and CPU, all I have is different values (very!). how it’s being done:

inline __device__ void addByte(volatile uint *s_WarpHist, uint dataR, uint dataG, uint dataB, uint threadTag)


	uint count, H,S,V, quantiz;


	// Normalization of bins

	int factor=0,ibinwert;

	float binwert;

	factor=0x7ff; //NoBitsProBin=11	// factor = 2047 (decimal)



		RGB_To_HSV(dataR, dataG, dataB, &H, &S, &V); // convert given rgb values to hsv format. as it is done in CPU

		quantiz = QuantScalableUniform1(H,S,V);      // quantize the hsv values, in order to get the pretended index

		cuPrintf("[H S V]: [%d  %d  %d]\n", H,S,V);  // actually, this is not printing anything at the moment :S		

		count = s_WarpHist[quantiz] & TAG_MASK;

		count = threadTag | (count + 1);

		s_WarpHist[quantiz] = count;


	}while(s_WarpHist[quantiz] != count);


and in

__global__ void histogram256Kernel(uint *d_PartialHistograms, uint *d_Data, uint dataCount)

im trying to do:



for(uint pos = UMAD(blockIdx.x, blockDim.x, threadIdx.x); pos < dataCount; pos += UMUL(blockDim.x, gridDim.x))


        uint dataR = d_Data[pos];

        uint dataG = d_Data[pos+1];

        uint dataB = d_Data[pos+2];

        addWord(s_WarpHist, dataR, dataG, dataB, tag);




the exact missing code is the SAME as the histogram sdk example. how can i transform it to perform correct RGB histogram?

2nd approach:

ok. since i’ve failed to get the 256bins histogram from a rgb video frame converted to RGB, i’ve tried to use NPP histogram… once again, without luck. why? because the given example is, once again, for grayscale images!

in here, the image is read from a file.

        // declare a host image object for an 8-bit grayscale image

        npp::ImageCPU_8u_C1 oHostSrc;

        // load gray-scale image from disk

        npp::loadImage(fileName, oHostSrc);

        // declara a device image and copy construct from the host image,

        // i.e. upload host to device

        npp::ImageNPP_8u_C1 oDeviceSrc(oHostSrc);

as my image is in memory already, i cant perform npp::loadImage(...)... another problem is that, probably i'll have to use ImageNPP_8u_C3 type (since its rgb), but either i have nppiHistogramEven_8u_C1R or nppiHistogramEven_8u_C4R... and nothing else between!

the steps i'm making are:

npp::ImageCPU_8u_C1 oDeviceSrc((unsigned int)*pFrameRGB->data[0],2u);

        // pFrameRGB->data[0] is my RGB image source (coming from ffmpeg linux usage)

NppiSize oSizeROI = {oDeviceSrc.width(), oDeviceSrc.height()};

int nDeviceBufferSize;

nppiHistogramEvenGetBufferSize_8u_C1R(oSizeROI, levelCount ,&nDeviceBufferSize);

Npp8u * pDeviceBuffer;

NPP_CHECK_CUDA(cudaMalloc((void **)&pDeviceBuffer, nDeviceBufferSize));

// compute levels values on host

Npp32s levelsHost[levelCount];

NPP_CHECK_NPP(nppiEvenLevelsHost_32s(levelsHost, levelCount, 0, binCount));

// compute the histogram

NPP_CHECK_NPP(nppiHistogramEven_8u_C1R(, oDeviceSrc.pitch(), oSizeROI, 

 histDevice, levelCount, 0, binCount, 


// copy histogram and levels to host memory

Npp32s histHost[binCount];

NPP_CHECK_CUDA(cudaMemcpy(histHost, histDevice, binCount * sizeof(Npp32s), cudaMemcpyDeviceToHost));

after this, “histHost” is different from the host version. :S any tips / help, please?


i’ve already solved the first attempt to compute the histogram, using the example code from SDK. the trick was to convert the image (2D structure) into a 1D vector. after that (and after being aware of the data types), i’ve made it.

But i would seriously like to accomplish the same with nvidia performance primitives. Can ANYONE help, please? I cant find proper examples, because there are just a few of them and… it seems that not everyone is capable of explain npp properly.