Hi,
I have found a strange bug when I test my program on TX2 with JetPack3.2. When I just created a small buffer with
float * grad;
if (cudaMallocManaged((void**)&grad, 256 * sizeof(float)) != cudaSuccess) {
printf("Malloc ret's cuda buffer failed.\n");
throw;
}
and I didn’t use it, I only created it. the process would slower than I didn’t create unified memory. And I tried to change the position in the program, maybe in main() function or in thread function, but the result was the same.
the demo is include face detection used deep network and other algorithm, when I didn’t create the buffer, the speed is (ms)
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.279999, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.400002, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.396667, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.500000, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.536667, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.643333, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.014999, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.110001, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.129997, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.175003, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.187500, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.282501, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.267998, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.356003, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.388000, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.419998, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.426003, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 92.494003, Thread = 3, face = 0.
when I create the buffer, the speed is (ms)
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.510002, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.849998, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.860001, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.949997, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 101.330002, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 101.570000, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.019997, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.180000, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.165001, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.245003, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.364998, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.555000, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 99.973335, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.063332, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.096664, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.209999, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.349998, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.440002, Thread = 0, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.462502, Thread = 4, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.547501, Thread = 5, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.535004, Thread = 3, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.584999, Thread = 1, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.652496, Thread = 2, face = 0.
<../FaceCaptureDemo/main.cpp, 250>: all time = 100.727501, Thread = 0, face = 0.