Why I create memory with cudaMallocManaged() would lower the speed of my process?

Hi,

I have found a strange bug when I test my program on TX2 with JetPack3.2. When I just created a small buffer with

float * grad;
if (cudaMallocManaged((void**)&grad, 256 * sizeof(float)) != cudaSuccess) {
    printf("Malloc ret's cuda buffer failed.\n");
    throw;
}

and I didn’t use it, I only created it. the process would slower than I didn’t create unified memory. And I tried to change the position in the program, maybe in main() function or in thread function, but the result was the same.

the demo is include face detection used deep network and other algorithm, when I didn’t create the buffer, the speed is (ms)

<../FaceCaptureDemo/main.cpp, 250>: all time = 92.279999, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.400002, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.396667, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.500000, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.536667, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.643333, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.014999, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.110001, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.129997, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.175003, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.187500, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.282501, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.267998, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.356003, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.388000, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.419998, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.426003, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.494003, Thread = 3, face = 0.

when I create the buffer, the speed is (ms)

<../FaceCaptureDemo/main.cpp, 250>: all time = 100.510002, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.849998, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.860001, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.949997, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 101.330002, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 101.570000, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.019997, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.180000, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.165001, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.245003, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.364998, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.555000, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 99.973335, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.063332, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.096664, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.209999, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.349998, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.440002, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.462502, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.547501, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.535004, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.584999, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.652496, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.727501, Thread = 0, face = 0.

Hi,

It’s recommended to check our document for guidance:
[url]https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management[/url]

Unified memory generates some overhead since it requires additional coherency and cache maintenance operations during the kernel launch, synchronization and prefetching hint calls. But you can benefit a lot on repetitive access pattern.

Try to find a suitable memory type from above tutorial first.

Thanks.