cudaMalloc: out of memory, although the GPU memory is enough

I have faced a strange problem about cudaMalloc reporting “out of memory” error although the GPU memory is enough.
The computer I used has two GTX1070 GPU cards and Windows 10 system.
The problem is that, I have two programs solving the same problem.
One program uses two CPU threads, each of the threads launches a GPU card of the two, two GPU cards solve the problem in cooperation, this program works correctly.
Another program uses the two GPU cards to solve the same problem, the only difference is each of the two GPU cards are launched from a different MPI process.
The strange problem is the latter program failed, because the cudaMalloc reports “out of memory”, although the program just need about half of the GPU memory in total.
I have called the cudaSetDevice before the cudaMalloc to make sure the two MPI processes operate on different GPUs.
The CUDA is version 10.0, and the MPI package I used is Microsoft MPI (MS-MPI). SLI is disabled in driver control panel.

Any idea what the problem could be? Thanks very much!

Are you running out of of CPU memory by any chance?

I got same problem with python api call dlib/GPU on face_recognition.

not sure if dlib problem?

CUDA test passed though.

cuda_data_ptr.cpp line 58 , out of memory.