cudamalloc slow on Kepler K10

Dear GPU developers,

I’m using Kepler K10 with cuda 5.5 on my C/Fortran GPU code. I noted some performances problem on cudaMalloc. In a for cycle I do a lot of cudaMalloc and cudaFree. The cudaMalloc take about 25 seconds in total to allocate the memory. The exact amount of memory with the same test on Tesla M2090 use half time, about 13 seconds. These are not the first cudaMalloc invoked from my code, so It not depends on some driver initialization.

Someone could explain why cudaMalloc are on Kepler K10 slower than Tesla M2090? It could depend on a particular GPU configuration?

Thanks.

Is this data from a controlled experiment, i.e. all hardware and software is identical and only the GPU is switched? What operating system is this? It sounds like the code makes a call cudaFree(0) or similar prior to the loop so issues with driver unloading and driver initialization overhead can be excluded? To be on the safe side, do the GPUs run in persistent mode?

The K10 consists of two physical GPUs, each attached to its own memory. When comparing execution time to the M2090, is the CUDA program running on just one of the K10 GPUs, or both of them? I am wondering whether the fact that the execution time on the K10 is about twice that of the M2090 is related to the fact that the K10 is comprised of two GPUs.

I take it that the 25 seconds represent the combined time for all calls to cudaMalloc() and cudaFree(). That would seem to include a great number of calls to those functions. Roughly how many calls are we talking about?

Regardless of platform (CPU or GPU) making frequent calls to malloc and free inside a loop rarely is good practice. Could the code be modified to re-use existing allocations? If that is not possible you could consider a custom sub-allocator to reduce calls to cudaMalloc() and cudaFree().

What I would do in a situation like this is switch to the latest CUDA version and driver to exclude any historical issues that may have already been addressed in the newer software. If the observed problem persists, filing of a bug report with a small repro code that reproduces the performance difference reliably could be in order.

Hi njuffa, thanks for the reply. I’m under Red Hat OS. The GPUs are in persistent mode. The hardware is not the same. The first hardware has INtel Xeon E5645 and Tesla M2090. The second case has INtel Xeon E5645 and 2 Kepler K10 cards. ( 4 GPUs total)

I’m using just one GPU on Kepler K10. The comparison is GPU vs. GPU. The total execution time is less on Kepler K10, but the “cudamalloc zone” is slower.

No, just for cudaMalloc. The total amount of cudaMAlloc calls is very huge. Thousand and thounsand.

The problem is that I don’t know the maximum buffer I need, and I don’t want to keep unused GPu memory when is not necessary.

Re: The problem is that I don’t know the maximum buffer I need, and I don’t want to keep unused GPu memory when is not necessary.

What is the smallest buffer size you anticipate occurring, and what is the largest? Why is it considered important to use as little GPU memory as possible in this section of code?

As for the platform differences, I have a vague recollection that increasing the number of GPUs in a system could have a negative impact on cudaMalloc() performance. Not sure whether that was really so and if my memory is correct, whether this has been addressed or is a fundamental limitation. txbob might have better information on this.

Hi njuffa. As you suggested, I moved cudaMalloc outside the loop doing a preallocation. Fortunately, I’m not using too much memory, also in a worst case. Now the code is 60% faster and scaleup by using 2, 3 4 GPU. Before that modify, 2 GPU had the same performances of one.

So it seems true your doubt about cudaMalloc. For some reason, improoving the GPUs on the node, the cudamalloc performance decrease.

EDIT:
So it seems true your doubt about cudaMalloc. For some reason, improoving the GPUs on the node, the cudamalloc performance decrease (also if I’m using just one GPU!!)

WHen UVM is enabled, which will be true for 64-bit linux systems for example, a call to cudaMalloc for one GPU will create an allocation on that GPU but it must also update the virtual address space mappings of all other GPUs in the system. This is one of the reasons that cudaMalloc is often a synchronous activity and also gives a clue as to why increasing the number of GPUs in the system may increase the time spent in a given cudaMalloc call, even though it appears to only pertain to a single device.

I suspect that if you are only using one GPU, and you launch your application with an appropriate usage of the environment variable:

CUDA_VISIBLE_DEVICES=“0” ./myapp

then this effect should be mitigated. But I have not tested this assertion. I may be wrong.

Hi txbob, I tried your suggest but the behaviour is the same. Other ideas?

Not really. Probably the best suggestion is the one already given by njuffa. Move the malloc operations outside of time-critical loops, to the extent possible.