DirectShow + CUDA real-time video streaming

I have developed real-time video streaming system with h264/SVC using serial programming, this system supports 640x480 at 30fps as the highest enhancement layer. Now I want to upgrade this system to support 1920x1080 at 30fps as the highest enhancement layer. I am currently working with cuda to achieve this, but having some issues, but let me elaborate the steps involved in my work a little bit more so we can both have a clear understanding.


1 - Video capturing using DirectShow at any desired resolution
2 - Using SampleGrabber call back function BufferCB( double time, unsigned char* pBuffer, long BufferLen ) to grabbed samples. This function automatically updates the pBuffer with every available data captured by the device and BufferLen updates with size of buffer as well.
3 - Converts the RGB data in pBuffer to YUV
4 - Downsample the YUV frames to a number of layers for spatial scalability.
5 - Encode the downsampled YUV frames and stream using rtsp.

So far, converted the colorspace convertion to cuda which reduces the processing time from 15ms to 7ms and that includes memory copy of the host data in pBuffer to device memory. I want to further reduce the processing time either by breaking my host data to multiple-streams and to process data cocurrently or by using pinned memory (zero-copy memory). I really don’t know how to do it because the host buffer memory that contained the data to process has been allocated by the ISampleGrabberCB in-build function which I think is not possible to change it. I want to know if there is any way change the system allocated memory to behave as if its allocated using cudaHostAlloc() so that I can use multiple-streams to process data cocurrently or pinned memory (zero-copy memory).