Memory allocation problem with multi-gpu (Tesla k80), possible cuda driver bug

SergeCell · April 13, 2015, 2:13pm

Hi
My system is Tesla K80, driver 340.87, cuda toolkit 6.5, ubuntu
I have problem with memory allocation on Tesla.
I want two concurrent processes use two different devices and each should have separate cuda memory.
The problem is:
if I use cudaMalloc all the memory is allocated on the first GPU (GPU Id 0), even if I explicitly set SetDevice(1) for the second GPU.
It seems unified memory access create problem if memory allocated on multi-gpu from two diffrenet processes which use different devices.(processes launched by MPI).
If I use cudaDeviceDisablePeerAccess(other_GPU_Id) the memory just duplicate - cudaMalloc allocate memory on both devices simultaneously.
Attempt to use cudaAllocMAnaged(,,cudaMemAttachHost) instead of cudaMalloc is causing crash of the process.
Is where are a way around the problem?
Thanks
Sergey

Robert_Crovella · April 13, 2015, 2:27pm

cudaMalloc does not allocate memory on 2 devices simultaneously - under any circumstances.

In a multi-gpu setup, cudaMallocManaged will allocate memory on multiple devices simultaneously.

If you don’t need this capability and you are launching separate processes, then prior to the start of each process that uses the GPU, you can set an environment variable that will restrict GPU usage by that process (including managed memory):

Process 1:
CUDA_VISIBLE_DEVICES=“0” ./my_app

Process 2:
CUDA_VISIBLE_DEVICES=“1” ./my_app

Note that in the above scenario, each process/app will see the selected GPU enumerated as device 0. That means each process/app should use cudaSetDevice(0);

More information about the environment variable:

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars[/url]

SergeCell · April 14, 2015, 10:32am

Thanks a lot!
It works properly now.

williamzhong · January 24, 2016, 11:45am

cudaMalloc does not allocate memory on 2 devices simultaneously - under any circumstances.

In a multi-gpu setup, cudaMallocManaged will allocate memory on multiple devices simultaneously.

If you don’t need this capability and you are launching separate processes, then prior to the start of each process that uses the GPU, you can set an environment variable that will restrict GPU usage by that process (including managed memory):

Process 1:
CUDA_VISIBLE_DEVICES=“0” ./my_app

Process 2:
CUDA_VISIBLE_DEVICES=“1” ./my_app

Note that in the above scenario, each process/app will see the selected GPU enumerated as device 0. That means each process/app should use cudaSetDevice(0);

More information about the environment variable:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars

Hi, txbob, thank you for your answer.
I’ve run a cuda program as you suggested in a computing server with 4 K80s setuped, but the app can only recognize the FB GPU memory.

The following is the test code:

#include <stdio.h>
#include <stdlib.h>

#include <cuda_runtime.h>
#include <driver_types.h>

int main() {
int device_count = 0;
cudaGetDeviceCount(&device_count);
printf("device_count=%d\n", device_count);

int device_id;
size_t free_mem;
size_t total_mem;
for(int i=0; i<device_count; i++) {
	cudaSetDevice(i);
	cudaDeviceProp device_prop;
	cudaGetDeviceProperties(&device_prop, i);
	cudaGetDevice(&device_id);
	cudaMemGetInfo(&free_mem, &total_mem);
	printf("device[%d]: free=[%.1f]M, total=[%.1f]M\n", device_id, (double)free_mem/1024/1024, (double)total_mem/1024/1024);
}
cudaDeviceReset();
return 0;
}

The compile code is as follows:

nvcc main.cu -o test

If the execute command is as follows:

./test

the output is as follows:
device_count=4
device[0]: free=[212.2]M, total=[11519.6]M
device[1]: free=[4143.2]M, total=[11519.6]M
device[2]: free=[3184.2]M, total=[11519.6]M
device[3]: free=[5502.4]M, total=[11519.6]M

If the execute command is as follows:

CUDA_VISIBLE_DEVICES="2" ./test
CUDA_VISIBLE_DEVICES="3" ./test

the output is as follow:
device_count=1
device[0]: free=[3184.7]M, total=[11519.6]M
device_count=1
device[0]: free=[5503.2]M, total=[11519.6]M

In fact, the apps always use the FB GPU memory, but the other BAR1 GPU memory are free as follows:

FB Memory Usage
    Total                       : 11519 MiB
    Used                        : 959 MiB
    Free                        : 10560 MiB
BAR1 Memory Usage
    Total                       : 16384 MiB
    Used                        : 4 MiB
    Free                        : 16380 MiB
Compute Mode                    : Default

How could I use the BAR1 memory? Thank you!

Robert_Crovella · January 30, 2016, 4:24pm

You’re confused about what BAR1 is. This was explained to you here:

[url]cuda - How to use NVIDIA K80? - Stack Overflow

BAR1 is not an “extra” memory region that you can use. It is a virtual mapping. From a programmers perspective, you can ignore it. The memory reported by cudaMemGetInfo is an accurate reflection of the memory available to you as a programmer.

williamzhong · February 20, 2016, 2:43am

Yes! Thank you very much for comments! That question is also posted by me.

Topic		Replies	Views
Dynamic memory allocation during kernel execution Is it posible? CUDA Programming and Performance	13	169403	January 25, 2013
Trouble Selecting single GPU on dual GPU system CUDA Setup and Installation cuda	0	293	December 14, 2023
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	4036	January 28, 2022
MultiGPU example in the CUDA SDK some stack problems CUDA Programming and Performance	5	3126	March 11, 2018
cudamalloc slow on Kepler K10 CUDA Programming and Performance	9	1104	October 28, 2014
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37702	April 23, 2009
How much GPU memory can cudaMalloc get? CUDA Programming and Performance	17	15174	April 2, 2022
CudaMalloc on Vista : strange behaviour Works on XP, Fails on Vista CUDA Programming and Performance	6	12258	July 1, 2009
cudaMallocManaged() not allocating memory in device memory CUDA Programming and Performance	4	2037	August 22, 2018
Large allocations with cudaMallocManaged slow down synchronization CUDA Programming and Performance	11	1685	October 26, 2020

Memory allocation problem with multi-gpu (Tesla k80), possible cuda driver bug

Related topics