Memory allocation problem with multi-gpu (Tesla k80), possible cuda driver bug

My system is Tesla K80, driver 340.87, cuda toolkit 6.5, ubuntu
I have problem with memory allocation on Tesla.
I want two concurrent processes use two different devices and each should have separate cuda memory.
The problem is:
if I use cudaMalloc all the memory is allocated on the first GPU (GPU Id 0), even if I explicitly set SetDevice(1) for the second GPU.
It seems unified memory access create problem if memory allocated on multi-gpu from two diffrenet processes which use different devices.(processes launched by MPI).
If I use cudaDeviceDisablePeerAccess(other_GPU_Id) the memory just duplicate - cudaMalloc allocate memory on both devices simultaneously.
Attempt to use cudaAllocMAnaged(,,cudaMemAttachHost) instead of cudaMalloc is causing crash of the process.
Is where are a way around the problem?

cudaMalloc does not allocate memory on 2 devices simultaneously - under any circumstances.

In a multi-gpu setup, cudaMallocManaged will allocate memory on multiple devices simultaneously.

If you don’t need this capability and you are launching separate processes, then prior to the start of each process that uses the GPU, you can set an environment variable that will restrict GPU usage by that process (including managed memory):

Process 1:

Process 2:

Note that in the above scenario, each process/app will see the selected GPU enumerated as device 0. That means each process/app should use cudaSetDevice(0);

More information about the environment variable:

Thanks a lot!
It works properly now.

Hi, txbob, thank you for your answer.
I’ve run a cuda program as you suggested in a computing server with 4 K80s setuped, but the app can only recognize the FB GPU memory.

The following is the test code:

#include <stdio.h>
#include <stdlib.h>

#include <cuda_runtime.h>
#include <driver_types.h>

int main() {
int device_count = 0;
printf("device_count=%d\n", device_count);

int device_id;
size_t free_mem;
size_t total_mem;
for(int i=0; i<device_count; i++) {
	cudaDeviceProp device_prop;
	cudaGetDeviceProperties(&device_prop, i);
	cudaMemGetInfo(&free_mem, &total_mem);
	printf("device[%d]: free=[%.1f]M, total=[%.1f]M\n", device_id, (double)free_mem/1024/1024, (double)total_mem/1024/1024);
return 0;

The compile code is as follows:

nvcc -o test

If the execute command is as follows:


the output is as follows:
device[0]: free=[212.2]M, total=[11519.6]M
device[1]: free=[4143.2]M, total=[11519.6]M
device[2]: free=[3184.2]M, total=[11519.6]M
device[3]: free=[5502.4]M, total=[11519.6]M

If the execute command is as follows:


the output is as follow:
device[0]: free=[3184.7]M, total=[11519.6]M
device[0]: free=[5503.2]M, total=[11519.6]M

In fact, the apps always use the FB GPU memory, but the other BAR1 GPU memory are free as follows:

FB Memory Usage
    Total                       : 11519 MiB
    Used                        : 959 MiB
    Free                        : 10560 MiB
BAR1 Memory Usage
    Total                       : 16384 MiB
    Used                        : 4 MiB
    Free                        : 16380 MiB
Compute Mode                    : Default

How could I use the BAR1 memory? Thank you!

You’re confused about what BAR1 is. This was explained to you here:

BAR1 is not an “extra” memory region that you can use. It is a virtual mapping. From a programmers perspective, you can ignore it. The memory reported by cudaMemGetInfo is an accurate reflection of the memory available to you as a programmer.

Yes! Thank you very much for comments! That question is also posted by me.