Unified memory with CUDA on Jetson Nano needs memcpy?

After reading this guide I thought I no longer need to use mecpy in case of unified memory on Tegra but probably I am wrong:

My main problem is that if in this code with unified memory I don’t do memcpy but pass the array directly to the CUDA kernel via unified memory it doesn’t work. It seems absurd on Tegra to have to use memcpy if there is unified memory because on the soc GPU and CPU really share the same memory:

InputImage = PPM_import("img/inputimage.ppm");

imageWidth = Image_getWidth(inputImage);
imageHeight = Image_getHeight(inputImage);
imageChannels = Image_getChannels(inputImage);

outputImage = Image_new(imageWidth, imageHeight, imageChannels);

hostInputImageData = Image_getData(inputImage);
hostOutputImageData = Image_getData(outputImage);
cudaMallocManaged((void **) &deviceInputImageData, imageWidth * imageHeight *
imageChannels * sizeof(float));
cudaMallocManaged((void **) &deviceOutputImageData, imageWidth * imageHeight *
imageChannels * sizeof(float));
cudaMallocManaged((void **) &deviceMaskData, maskRows * maskCols * sizeof(float));
//memcpy(deviceInputImageData, hostInputImageData, imageWidth * imageHeight * imageChannels * sizeof(float)); //<- IT WORKS
deviceInputImageData = Image_getData(inputImage);//<- NOT WORKS

cudaMemcpy(deviceMaskData, hostMaskData, maskRows * maskCols * sizeof(float), cudaMemcpyHostToDevice);

dim3 dimGrid(ceil((float) imageWidth/TILE_WIDTH),
ceil((float) imageHeight/TILE_WIDTH));
dim3 dimBlock(TILE_WIDTH,TILE_WIDTH,1);

myKernelProcessing<<<dimGrid,dimBlock>>>(deviceInputImageData, deviceMaskData, deviceOutputImageData,imageChannels, imageWidth, imageHeight);

I don’t understand where I’m wrong.

Thanks in advance

You may try these examples for a way of using unified memory:

In short, first allocate unified memory. You would then be able to use its address for both CPU and GPU processing, such as read from CPU into it, then transform it from GPU.

Thanks, but I used cudaMallocManaged and then allocated the unified memory. In this case, shouldn’t memcpy be used anymore? Can I please ask you to modify my code to understand where am I wrong in allocating the memory to pass to the CUDA kernel?

Thanks again

I took a few minutes to give you these links that would help to better understand.
You replied before trying any.
Sorry, but I’m not working for you.

I have already read these links before making the post, I apologize if you interpreted this way … unfortunately in the posts you linked to me (one is a topic that I created and to which you have brilliantly answered) the code is written in opencv and I don’t understand where am I wrong in my CUDA code. Sorry but I searched thoroughly in the forum before posting. If you want to be paid I am still available to pay you for the consultation.

1 Like

Sorry I didn’t remember your pseudo. Ok, I’ll try to have a look at your code and help if I can, but I can’t promise any delay for personnal reasons.

No problem,
thanks again

I fail to understand what your code does in some functions, but you would try to :

  1. First allocate unified memory buffers for kernel, input and output buffers. These would be available from CPU and GPU at same address.
  2. Fill your unified memory kernel from CPU. It will be available from GPU as well.
  3. Get your input data from CPU and have it copied into unified memory input buffer.
  4. Process it from CUDA as GPU buffer with same address, using kernel unified memory address. That should produce output buffer.
  5. You may have to call cudaDeviceSynchronize() in some cases here.
  6. Read GPU-processed output buffer from CPU from same unified output buffer address.

Dear Honey thank you. The problem was related to the fact that in the posted code, not using opencv I was messing with pointers. I followed your suggestions step by step and realized that I was not using managed memory when passing the pointer to the kernel. I rewrote everything using OpenCv and realized where I was wrong thanks to your suggestions.

Thanks again