cudaMalloc extremely slow on gtx980 and titan

jskang · July 30, 2015, 12:59pm

I have 2 gtx 980 and 2 titans on my server and the malloc is extremely slow. 2 gtx980 and 1 titan runs on X8 and one titan runs on X16. I have a macbook pro and Cuda program runs faster on my macbook pro than it does on my server. I run my code and check nvidia-smi. No matter which device I choose to run it on using cudaSetDevice, it will start allocating memory in both the device that I chose and the titan that is running on X16. My code isn’t doing something fancy it’s Cuda with some OpenCV. Has anyone run into bug like this? I know that nvidia-smi doesn’t really work on GTX 980s but I run into this problem even if I run it on the other Titan.

Clochette · July 30, 2015, 2:06pm

Call cudaSetDevice FIRST, and then do the malloc.

jskang · July 30, 2015, 3:12pm

That’s what I’m doing in my code

cudaSetDevice(2);
cudaMalloc((void**)&dev_result, result_size);
cudaMalloc((void**)&dev_dest_lines, line_size);
cudaMalloc((void**)&dev_src_lines, line_size);

cudaMemcpy((void**)dev_dest_lines, g_dest_lines, line_size, cudaMemcpyHostToDevice);
cudaMemcpy((void**)dev_src_lines, g_src_lines, line_size, cudaMemcpyHostToDevice);

Clochette · July 30, 2015, 3:14pm

Are you also doing cudaMallocHost? It has to be called after cudaSetDevice as well.

jskang · July 30, 2015, 4:17pm

I was using normal malloc but I tried it with cudaMallocHost but it does the same thing.

njuffa · July 30, 2015, 4:27pm

It is not clear how you are measuring the performance of cudaMalloc(). If cudaMalloc() is the first CUDA API call [other than cudaSetDevice()] in your code it will trigger the creation of the CUDA context. It is my understanding that the mapping activities needed for UVM support can lengthen CUDA context creation time considerably on machines with large amounts of system memory, buit this is a one-time startup cost.

Try calling cudaFree(0) to trigger the CUDA context creation, then measure the duration of the cudaMalloc() calls.

jskang · July 30, 2015, 4:48pm

njuffa you were right. cudaMalloc was taking a long time because of the context creation now cudaFree is the on that takes the longest time. Is there a way to reduce the context creation cost? Also there is still the issue of my program creating memory in a different device than what I set in cudaSetDevice, would you happen to know anything about that as well?

Robert_Crovella · July 30, 2015, 4:52pm

If your program only intends to use a single device, you can limit the CUDA runtime for that session/run to only use a single device with the CUDA_VISIBLE_DEVICES environment variable, documented here:

[url]Programming Guide :: CUDA Toolkit Documentation

jskang · July 30, 2015, 6:31pm

txbob CUDA_VISIBLE_DEVICES worked! Thanks! Do you know why the CUDA runtime has this weird behaviour?

njuffa · July 30, 2015, 6:40pm

What specifically do you consider “weird behavior”?

jskang · July 30, 2015, 6:51pm

Without setting CUDA_VISIBLE_DEVICES when I run my program and I set my cudaSetDevice to use gtx980 the runtime will create memory on the Titan as well. I will have memory created on the gtx980 that is running my code and also on the titan. 980 will use about 300mb and then the titan will use about 100mb. That is what I see on the nvidia-smi. After I set CUDA_VISIBLE_DEVICES this problem doesn’t happen.

Robert_Crovella · July 30, 2015, 6:57pm

Which device indices (as enumerated by the CUDA runtime) are the GPUs in question?

The process of creating a cuda context on a particular device consumes memory. It’s likely that the runtime is creating a context of some sort on the “unused” titan, to consume 100mb. My guess is that the Titan in question is enumerated as device 0, in which case the behavior doesn’t surprise me, although I can’t give you chapter and verse of the documentation which describes exactly why this should be the case.

Nevertheless, the CUDA runtime has all “exposed” devices in its view. This has widespread implications for UVA, UM, SLI, P2P, and many other mechanisms under the CUDA umbrella. If you want to limit this “view”, use CUDA_VISIBLE_DEVICES.

njuffa · July 30, 2015, 7:10pm

As txbob says, the combined footprint of the CYDA driver and CUDA runtime context on each device is in the 90 MB to 100MB range. So even if you do not run anything on a device this much memory is going to be occupied by the CUDA software stack itself.

Unified memory requires CUDA to map the memory from each GPU in the system and all of host memory into a single unified virtual address space. My understanding is that the vast majority of the time required for this is spent in OS calls, and it increases with the total amount of GPU + host memory that needs to be mapped.

As far as the enumeration of devices by the CUDA runtime goes, my understanding is that the CUDA runtime contains a heuristic that tries to assign the “most capable” device in a system as device 0. If the GPUs in question are GTX 980 and GTX Titan it stands to reason that the Titan would wind up as device 0 since it is the “more capable” device.

CUDA_VISIBLE_DEVICES can be used to exclude specific GPUs from both the enumeration and the memory mapping process performed by the CUDA runtime.

jskang · July 30, 2015, 8:36pm

Oh man! Thanks guys! I’ve learned so much!

Topic		Replies	Views
CUDA initialization very slow on GeForce GTX 465 Initialization takes 1-4 seconds on GeForce GTX 4 CUDA Programming and Performance	4	4195	November 22, 2012
cudaMalloc hangs for several minutes on Titans on CentOS5_x64 CUDA Setup and Installation	6	3638	June 12, 2013
Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	3348	June 23, 2009
Does cudasetdevice() allocate memory ？ CUDA Programming and Performance	2	126	July 5, 2024
CudaMalloc is taking huge time for first time, How to overcome this issue CUDA Programming and Performance cuda	1	1072	April 12, 2021
cudaSetDevice() time, so weird! cudaSetDevice() take a long time. CUDA Programming and Performance	10	4624	August 2, 2010
cudaMalloc's taking different times CUDA Programming and Performance	3	1917	December 22, 2010
Memory allocation time problem CUDA Programming and Performance	3	744	March 11, 2011
Memory allocation problem with multi-gpu (Tesla k80), possible cuda driver bug CUDA Programming and Performance	5	4062	February 20, 2016
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23191	July 8, 2011

cudaMalloc extremely slow on gtx980 and titan

Related topics