Unified memory with CUDA on Jetson Nano needs memcpy?

thej33k · October 22, 2020, 5:16pm

After reading this guide I thought I no longer need to use mecpy in case of unified memory on Tegra but probably I am wrong:

My main problem is that if in this code with unified memory I don’t do memcpy but pass the array directly to the CUDA kernel via unified memory it doesn’t work. It seems absurd on Tegra to have to use memcpy if there is unified memory because on the soc GPU and CPU really share the same memory:

InputImage = PPM_import("img/inputimage.ppm");

imageWidth = Image_getWidth(inputImage);
imageHeight = Image_getHeight(inputImage);
imageChannels = Image_getChannels(inputImage);

outputImage = Image_new(imageWidth, imageHeight, imageChannels);

hostInputImageData = Image_getData(inputImage);
hostOutputImageData = Image_getData(outputImage);
cudaDeviceReset();
cudaMallocManaged((void **) &deviceInputImageData, imageWidth * imageHeight *
imageChannels * sizeof(float));
cudaMallocManaged((void **) &deviceOutputImageData, imageWidth * imageHeight *
imageChannels * sizeof(float));
cudaMallocManaged((void **) &deviceMaskData, maskRows * maskCols * sizeof(float));
//memcpy(deviceInputImageData, hostInputImageData, imageWidth * imageHeight * imageChannels * sizeof(float)); //<- IT WORKS
deviceInputImageData = Image_getData(inputImage);//<- NOT WORKS

cudaMemcpy(deviceMaskData, hostMaskData, maskRows * maskCols * sizeof(float), cudaMemcpyHostToDevice);


dim3 dimGrid(ceil((float) imageWidth/TILE_WIDTH),
ceil((float) imageHeight/TILE_WIDTH));
dim3 dimBlock(TILE_WIDTH,TILE_WIDTH,1);



myKernelProcessing<<<dimGrid,dimBlock>>>(deviceInputImageData, deviceMaskData, deviceOutputImageData,imageChannels, imageWidth, imageHeight);

I don’t understand where I’m wrong.

Thanks in advance

Honey_Patouceul · October 22, 2020, 5:27pm

You may try these examples for a way of using unified memory:

In short, first allocate unified memory. You would then be able to use its address for both CPU and GPU processing, such as read from CPU into it, then transform it from GPU.

thej33k · October 22, 2020, 5:36pm

Thanks, but I used cudaMallocManaged and then allocated the unified memory. In this case, shouldn’t memcpy be used anymore? Can I please ask you to modify my code to understand where am I wrong in allocating the memory to pass to the CUDA kernel?

Thanks again

Honey_Patouceul · October 22, 2020, 5:39pm

I took a few minutes to give you these links that would help to better understand.
You replied before trying any.
Sorry, but I’m not working for you.

thej33k · October 22, 2020, 5:45pm

I have already read these links before making the post, I apologize if you interpreted this way … unfortunately in the posts you linked to me (one is a topic that I created and to which you have brilliantly answered) the code is written in opencv and I don’t understand where am I wrong in my CUDA code. Sorry but I searched thoroughly in the forum before posting. If you want to be paid I am still available to pay you for the consultation.

Honey_Patouceul · October 22, 2020, 5:53pm

Sorry I didn’t remember your pseudo. Ok, I’ll try to have a look at your code and help if I can, but I can’t promise any delay for personnal reasons.

thej33k · October 22, 2020, 5:54pm

No problem,
thanks again

Honey_Patouceul · October 22, 2020, 6:13pm

I fail to understand what your code does in some functions, but you would try to :

First allocate unified memory buffers for kernel, input and output buffers. These would be available from CPU and GPU at same address.
Fill your unified memory kernel from CPU. It will be available from GPU as well.
Get your input data from CPU and have it copied into unified memory input buffer.
Process it from CUDA as GPU buffer with same address, using kernel unified memory address. That should produce output buffer.
You may have to call cudaDeviceSynchronize() in some cases here.
Read GPU-processed output buffer from CPU from same unified output buffer address.

thej33k · October 23, 2020, 8:59pm

Dear Honey thank you. The problem was related to the fact that in the posted code, not using opencv I was messing with pointers. I followed your suggestions step by step and realized that I was not using managed memory when passing the pointer to the kernel. I rewrote everything using OpenCv and realized where I was wrong thanks to your suggestions.

Thanks again

Topic		Replies	Views
Zero-Copy and Managed memory on Jetson Jetson TX1	9	11526	August 20, 2018
Opencv cuda convolution extremly slower than bare cuda code convolution on Jetson Nano using unified memory Jetson Nano opencv	12	3618	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
Zero Copy Memory vs Unified memory CUDA processing Jetson TX1	28	20076	October 18, 2021
Optimising GPU and CPU memory transfer time (CUDA/Hardware)? CUDA Programming and Performance hw , cuda	8	3936	January 7, 2022
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1713	October 18, 2021
Performance issues after refactoring CUDA code to avoid managed memory CUDA Programming and Performance jetson	5	46	November 19, 2024
Unified memory not working completely Jetson TX1	4	1405	October 18, 2021
RE: Performance issues after refactoring CUDA code to avoid managed memory Jetson AGX Xavier cuda	4	34	November 25, 2024
Best hardware options to reduce GPU and CPU memory transfer time? Jetson Nano	6	1039	January 19, 2022

Unified memory with CUDA on Jetson Nano needs memcpy?

Related topics