cudamalloc slow on Kepler K10

unrue · October 16, 2014, 1:02pm

Dear GPU developers,

I’m using Kepler K10 with cuda 5.5 on my C/Fortran GPU code. I noted some performances problem on cudaMalloc. In a for cycle I do a lot of cudaMalloc and cudaFree. The cudaMalloc take about 25 seconds in total to allocate the memory. The exact amount of memory with the same test on Tesla M2090 use half time, about 13 seconds. These are not the first cudaMalloc invoked from my code, so It not depends on some driver initialization.

Someone could explain why cudaMalloc are on Kepler K10 slower than Tesla M2090? It could depend on a particular GPU configuration?

Thanks.

njuffa · October 16, 2014, 1:52pm

Is this data from a controlled experiment, i.e. all hardware and software is identical and only the GPU is switched? What operating system is this? It sounds like the code makes a call cudaFree(0) or similar prior to the loop so issues with driver unloading and driver initialization overhead can be excluded? To be on the safe side, do the GPUs run in persistent mode?

The K10 consists of two physical GPUs, each attached to its own memory. When comparing execution time to the M2090, is the CUDA program running on just one of the K10 GPUs, or both of them? I am wondering whether the fact that the execution time on the K10 is about twice that of the M2090 is related to the fact that the K10 is comprised of two GPUs.

I take it that the 25 seconds represent the combined time for all calls to cudaMalloc() and cudaFree(). That would seem to include a great number of calls to those functions. Roughly how many calls are we talking about?

Regardless of platform (CPU or GPU) making frequent calls to malloc and free inside a loop rarely is good practice. Could the code be modified to re-use existing allocations? If that is not possible you could consider a custom sub-allocator to reduce calls to cudaMalloc() and cudaFree().

What I would do in a situation like this is switch to the latest CUDA version and driver to exclude any historical issues that may have already been addressed in the newer software. If the observed problem persists, filing of a bug report with a small repro code that reproduces the performance difference reliably could be in order.

unrue · October 16, 2014, 2:02pm

Hi njuffa, thanks for the reply. I’m under Red Hat OS. The GPUs are in persistent mode. The hardware is not the same. The first hardware has INtel Xeon E5645 and Tesla M2090. The second case has INtel Xeon E5645 and 2 Kepler K10 cards. ( 4 GPUs total)

I’m using just one GPU on Kepler K10. The comparison is GPU vs. GPU. The total execution time is less on Kepler K10, but the “cudamalloc zone” is slower.

No, just for cudaMalloc. The total amount of cudaMAlloc calls is very huge. Thousand and thounsand.

The problem is that I don’t know the maximum buffer I need, and I don’t want to keep unused GPu memory when is not necessary.

unrue · October 16, 2014, 2:03pm

unrue:

njuffa:

Is this data from a controlled experiment, i.e. all hardware and software is identical and only the GPU is switched? What operating system is this? It sounds like the code makes a call cudaFree(0) or similar prior to the loop so issues with driver unloading and driver initialization overhead can be excluded? To be on the safe side, do the GPUs run in persistent mode?

Hi njuffa, thanks for the reply. I’m under Red Hat OS. The GPUs are in persistent mode. The hardware is not the same. The first hardware has INtel Xeon E5645 and Tesla M2090. The second case has INtel Xeon E5645 and 2 Kepler K10 cards. ( 4 GPUs total)

njuffa:

The K10 consists of two physical GPUs, each attached to its own memory. When comparing execution time to the M2090, is the CUDA program running on just one of the K10 GPUs, or both of them? I am wondering whether the fact that the execution time on the K10 is about twice that of the M2090 is related to the fact that the K10 is comprised of two GPUs.

I’m using just one GPU on Kepler K10. The comparison is GPU vs. GPU. The total execution time is less on Kepler K10, but the “cudamalloc zone” is slower.

njuffa:

I take it that the 25 seconds represent the combined time for all calls to cudaMalloc() and cudaFree(). That would seem to include a great number of calls to those functions. Roughly how many calls are we talking about?

No, just for cudaMalloc. The total amount of cudaMAlloc calls is very huge. Thousand and thounsand.

njuffa:

Regardless of platform (CPU or GPU) frequent calls to malloc and free inside a loop is rarely good practice. Could the code be modified to re-use existing allocations? If that is not possible you could consider a custom sub-allocator to reduce calls to cudaMalloc() and cudaFree().

What I would do in a situation like this is switch to the latest CUDA version and driver to exclude any historical issues that may have already been addressed in the newer software. If the observed problem persists, filing of a bug report with a small repro code that reproduces the performance difference reliably could be in order.

The problem is that I don’t know the maximum buffer I need, and I don’t want to keep unused GPu memory when is not necessary. I can reduce the total calls on cudaMalloc doing some kind of code refactoring, but I would like to know why on Kepler K10 the behaviour is so different.

njuffa · October 16, 2014, 2:29pm

Re: The problem is that I don’t know the maximum buffer I need, and I don’t want to keep unused GPu memory when is not necessary.

What is the smallest buffer size you anticipate occurring, and what is the largest? Why is it considered important to use as little GPU memory as possible in this section of code?

As for the platform differences, I have a vague recollection that increasing the number of GPUs in a system could have a negative impact on cudaMalloc() performance. Not sure whether that was really so and if my memory is correct, whether this has been addressed or is a fundamental limitation. txbob might have better information on this.

unrue · October 17, 2014, 1:16pm

Hi njuffa. As you suggested, I moved cudaMalloc outside the loop doing a preallocation. Fortunately, I’m not using too much memory, also in a worst case. Now the code is 60% faster and scaleup by using 2, 3 4 GPU. Before that modify, 2 GPU had the same performances of one.

So it seems true your doubt about cudaMalloc. For some reason, improoving the GPUs on the node, the cudamalloc performance decrease.

unrue · October 17, 2014, 1:17pm

EDIT:
So it seems true your doubt about cudaMalloc. For some reason, improoving the GPUs on the node, the cudamalloc performance decrease (also if I’m using just one GPU!!)

Robert_Crovella · October 17, 2014, 1:58pm

WHen UVM is enabled, which will be true for 64-bit linux systems for example, a call to cudaMalloc for one GPU will create an allocation on that GPU but it must also update the virtual address space mappings of all other GPUs in the system. This is one of the reasons that cudaMalloc is often a synchronous activity and also gives a clue as to why increasing the number of GPUs in the system may increase the time spent in a given cudaMalloc call, even though it appears to only pertain to a single device.

I suspect that if you are only using one GPU, and you launch your application with an appropriate usage of the environment variable:

CUDA_VISIBLE_DEVICES=“0” ./myapp

then this effect should be mitigated. But I have not tested this assertion. I may be wrong.

unrue · October 28, 2014, 9:43am

Hi txbob, I tried your suggest but the behaviour is the same. Other ideas?

Robert_Crovella · October 28, 2014, 1:03pm

Not really. Probably the best suggestion is the one already given by njuffa. Move the malloc operations outside of time-critical loops, to the extent possible.

Topic		Replies	Views
cudaMalloc hang when building x64 version binary CUDA Programming and Performance	23	3828	June 26, 2017
Question about GPU Memory Overhead with Cudamallocmanaged CUDA Programming and Performance	7	1011	August 21, 2024
cudaMalloc(long*, size) failure on Tesla C2070 device CUDA Programming and Performance	4	1431	March 15, 2012
cudaMalloc error in big loop CUDA Programming and Performance	12	15612	May 21, 2008
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2366	January 9, 2017
cudaMalloc(Pitch) _significantly_ slower on windows with Geforce drivers > 350.12 CUDA Programming and Performance	10	2552	February 10, 2017
cudaMalloc KILLED on tx2, and the memory can not be cudaFree real Jetson TX2	10	1123	December 2, 2019
Cuda runtime call after driver api call, excessive overhead CUDA Programming and Performance cuda , driver , api	17	1955	December 24, 2021
CUDA, NUMA Memory, and "NUMA" GPUs CUDA Programming and Performance	3	19890	December 4, 2010
CudaMalloc slow on DriveAGX DRIVE AGX Xavier General	8	631	October 12, 2021

cudamalloc slow on Kepler K10

Related topics