Concurrent CPU and GPU processing

Hi Folks,

I have a CV/image processing pipeline - where work is split between GPU and CPU. GPU handles about 50% of workload.

I am looking for ways / examples to operate GPU and CPU concurrently to optimize performance. Currently when I launch CUDA kernel - we wait with cudaDeviceSynchronize() - which blocks CPU. I would like CPU to not block and continue processing another frame. Would like GPU to run / call another callback when it is done with its previously submitted stream/workload.

Could some please recommend an example that I can follow, to implement a image processing pipeline where GPU and CPU concurrently operate on different frames.


Check out the cudaAddStreamCallback() function, it calls a function of your choosing on the host (CPU) when your kernel is done executing on the GPU. You can also use the cudaEvents APIs to query progress asynchronously.

Hi dusty_nv,

Thanks for your help and API suggestion.

I followed it, while I do get the callback to work correctly - I feel I still have not gotten something in my implementation right.

When I successively launch multiple kernels one after another (each with unique stream) I get call back from all of them. I have a CPU thread that waits on its input Q (which is updated by callback - showing completion of kernel) . THis thread processes output of Gpu while Gpu is working next image. I am getting lot of memory corruption (segfault, bus error) in this thread - when Gpu and cpu are operating concurrently. If I do not operate them concurrently - then I do not get any memory corruption.

With this background - my question is - is there an example that you can point me to - which operates GPU and CPU concurrently for purpose of processing images ?


There are examples in the CUDA toolkit samples that show the use of CUDA streams and concurrent kernels.

Are you sure that CPU and GPU are not operating on the same image or overlapping memory simultaneously?

There is overalpping memory access between gpu and cpu, as far as input buffers are concerned. Those input buffers are read only for both gpu and cpu. The output buffer are not concurrently accessed.

This is how my kernel looks -

test_kernel (char *y1, char *y2, char *y3, char *y4, char *outputy)

y1,..,y4 are input buffers of luma. And outputy is output buffer into which kernel will write its output.

After launching kernel, I do not access outputy, from CPU - but do access y2, and y3 on CPU. Also I access outputy_of_oldframe on CPU.

I was reading the programming guide - (section k.2.2).

I am wondering whether I need to call -

cudaStreamAttachMemAsync() for y1,...,y4 (i.e. my input frame buffers) ?



Please remember that a managed memory is set to global if it doesn’t attach to any stream.
Once a kernel is launched with an unified buffer, it would make all global managed memory inaccessible on CPU.


Hi AastaLLL and dusty_nv

Thanks for help.

I am able to have concurrent access of buffer from CPU and GPU by attaching input and output buffers. I.e. I have


for all input buffer (y1, ...., y4, and outputy).

However I am running into problem (bus error) when I try to retire (delete ) on one of input buffer (y1, …,y4) . What is best way to delete or “unattach” a frame buffer which was previously attached using - cudaStreamAttachMemAsync() ?

Please note that I am reading input- buffer (i.e. each of y1, …,y4) this way -

UniqueObj<Frame> frame(iFrameConsumer->acquireFrame());
        IFrame *iFrame = interface_cast<IFrame>(frame);
        if (!iFrame)
        // Get the Frame's Image.
        Image *image = iFrame->getImage();
        EGLStream::NV::IImageNativeBuffer *iImageNativeBuffer
              = interface_cast<EGLStream::NV::IImageNativeBuffer>(image);
        TEST_ERROR_RETURN(!iImageNativeBuffer, "Failed to create an IImageNativeBuffer");

        if (!m_pCamInfo->bypassCamCapture) {
               aaFrameBuffer *framedata = new aaFrameBuffer;
               framedata->framefd = iImageNativeBuffer->createNvBuffer(ARGUSSIZE {m_pCamInfo->liveParams.inputVideoInfo.width, m_pCamInfo->liveParams.inputVideoInfo.height},
               NvBufferColorFormat_YUV420, NvBufferLayout_Pitch, &status);
	       AACAM_CAPTURE_PRINT("3 Starting frame caputre  %d \n",m_currentFrame);

               if (status != STATUS_OK)
                  TEST_ERROR_RETURN(status != STATUS_OK, "Failed to create a native buffer");

		NvBufferGetParams(framedata->framefd, &(framedata->nvBufParams));

          	framedata->dataY = (char *)mmap(NULL, framedata->fsizeY, PROT_READ | PROT_WRITE, MAP_SHARED, framedata->framefd, framedata->nvBufParams.offset[0]);
		framedata->dataU = (char *)mmap(NULL, framedata->fsizeU, PROT_READ | PROT_WRITE, MAP_SHARED, framedata->framefd, framedata->nvBufParams.offset[1]);
		framedata->dataV = (char *)mmap(NULL, framedata->fsizeV, PROT_READ | PROT_WRITE, MAP_SHARED, framedata->framefd, framedata->nvBufParams.offset[2]);

		if (framedata->dataY == MAP_FAILED) {
         	   printf("mmap failed : %s\n", strerror(errno));

And I have having bus error while retiring/unallocating the buffer -

munmap(framedataDelElement.dataY, framedata->fsizeY);
         munmap(framedataDelElement.dataU, framedata->fsizeU);
         munmap(framedataDelElement.dataV, framedata->fsizeV);


In general, bus error is caused from current access.
You can allocate a buffer for both CPU/GPU accessible but cannot access it concurrently.

We have a good example to demonstrate cudaStreamAttachMemAsync:

cudaStreamSynchronize(stream[0]) is essential after memory sync.

Hi AastaLLL,

In the example 0_Simple, I see that CPU and GPU do NOT concurrently access a buffer. This matches with your suggestion of avoiding concurrent access.

I wish we can “utilize” idle time for CPU. That is do something useful with input image while GPU is processing it.

I tried a simple code to see if that is possible or not. Please see the code attached.

__global__ void
test_kernel (unsigned char *In, unsigned char *Out,
    int width, int height, int pitch)
  /* Compute indexes */
   const int cornerCol = blockDim.x*blockIdx.x;  
   const int cornerRow = blockDim.y*blockIdx.y;  
   const int globalCol1 = cornerCol + threadIdx.x;  
   const int globalRow1 = cornerRow + threadIdx.y;
   int idx = (globalRow1 * width  + globalCol1) * 3;

   if ((globalRow1 >= height) || (globalCol1 >= width)) return;
   if( (idx+3) > (width*height*3)) return;

   Out[idx]   = In[idx];
   Out[idx+1] = In[idx+1];
   Out[idx+2] = In[idx+2];


int runTest5()
    cudaStream_t stream1;
    cudaError_t cudaErr;

    cv::Mat im1 = cv::imread("Images/frame-36460.jpg", cv::IMREAD_COLOR);

    //Specify a reasonable block size./0_Simple/UnifiedMemoryStreams/
    const dim3 block(32,32);
    const dim3 grid((im1.cols + block.x - 1)/block.x, (im1.rows + block.y - 1)/block.y);

        fprintf(stderr,"Cannot read image input image from file:\n");
        return -1;

    unsigned char *inimg, *outimg;
    cudaMallocManaged(&(inimg), im1.rows*im1.cols*3);
    cudaMallocManaged(&(outimg), im1.rows*im1.cols*3);

    memcpy(inimg,,  im1.rows*im1.cols*3); 

    cudaStreamAttachMemAsync(stream1, inimg, 0, cudaMemAttachHost);
    cudaStreamAttachMemAsync(stream1, outimg, 0, cudaMemAttachHost);

    test_kernel  <<< grid, block >>> ((unsigned char *) inimg, (unsigned char *) outimg, 
                                                            im1.cols, im1.rows, im1.step) ;

    //cudaDeviceSynchronize ();
    cv::Mat im2 =  cv::Mat(im1.rows, im1.cols, CV_8UC3, (void *)inimg, im1.step);  

    cv::Mat gray,edge, draw;
    cvtColor(im2, gray, CV_BGR2GRAY);
    Canny(gray, edge, 50, 150, 3);
    edge.convertTo(draw, CV_8U);


    cudaFree(inimg); cudaFree(outimg);
    return  0;

// Program main
main(int argc, char **argv)

This code displays intended edge image. Here I am able to concurrently access unsigned char *inimg from Gpu and CPU concurrently. This makes me wonder - whether I am able comprehend what you said about concurrent access ?


Hi Nvidia folks

Can you please guide about whether the code above make sense or not ? Whether we can access same input buffer concurrently from GPU and CPU. In the code above I am able to do that.



Concurrent access is not supported on Jetson.
That means you need to follow below rule or a bus error/segmentation fault will occur.
process_1 > Sync > process_2 > sync > process_1 > …

You can find more information on our CUDA document:


the Jetson TX2 has 2 multiproccessors (MP) of 128 cores each. Is is possible to use these two MPs concurrently by 2 different processes?
I mean:
process A uses MP 1
process B uses MP 2