cudaMalloc's speed is two slow

monramax · November 8, 2010, 7:50pm

I have tested cudaMalloc’s speed.

==========================================================
cudaMalloc((void **) &xc, n * sizeof(float));
cudaFree(xc);

When I allocated 10 * sizeof(float) memory, the time the code above used was about 6e-3s, and Linux just used 4e-8s (malloc, free). cudaMallocHost/cudaFreeHost even took longer time than cudaMalloc/cudaFree.

The test was performed under Fedora 13, gcc-4.4, cuda 3.2 and Tesla C2050.

Do we have any method to improve these functions’ speed? Thanks.

monramax · November 8, 2010, 7:50pm

I have tested cudaMalloc’s speed.

==========================================================
cudaMalloc((void **) &xc, n * sizeof(float));
cudaFree(xc);

When I allocated 10 * sizeof(float) memory, the time the code above used was about 6e-3s, and Linux just used 4e-8s (malloc, free). cudaMallocHost/cudaFreeHost even took longer time than cudaMalloc/cudaFree.

The test was performed under Fedora 13, gcc-4.4, cuda 3.2 and Tesla C2050.

Do we have any method to improve these functions’ speed? Thanks.

YDD · November 9, 2010, 2:59pm

While it might be possible to make the calls go slightly faster, not really. You don’t show enough code for me to be certain, but I suspect that your timings for malloc/free on the CPU are fictions, the result of a carefully woven web of deceit spun by the OS. You see, when you call malloc on a machine with a virtual memory system, you don’t really allocate any memory. The OS just marks off pages as being available for use by your application. It’s only when you start accessing those pages that the VM system starts mapping them to something ‘real.’ As a result, the first access to that memory will be slow.

In contrast, [font=“Courier New”]cudaMalloc[/font] on the GPU has to make sure that there’s really that block of memory available right now, which takes a little extra time. And [font=“Courier New”]cudaMallocHost[/font] has it even worse, since it has to allocate real, contiguous transistors almost in defiance of the usual VM system. If those transistors aren’t free, then the OS will have to swap out their contents first, before the call can return.

YDD · November 9, 2010, 2:59pm

While it might be possible to make the calls go slightly faster, not really. You don’t show enough code for me to be certain, but I suspect that your timings for malloc/free on the CPU are fictions, the result of a carefully woven web of deceit spun by the OS. You see, when you call malloc on a machine with a virtual memory system, you don’t really allocate any memory. The OS just marks off pages as being available for use by your application. It’s only when you start accessing those pages that the VM system starts mapping them to something ‘real.’ As a result, the first access to that memory will be slow.

In contrast, [font=“Courier New”]cudaMalloc[/font] on the GPU has to make sure that there’s really that block of memory available right now, which takes a little extra time. And [font=“Courier New”]cudaMallocHost[/font] has it even worse, since it has to allocate real, contiguous transistors almost in defiance of the usual VM system. If those transistors aren’t free, then the OS will have to swap out their contents first, before the call can return.

monramax · November 9, 2010, 5:46pm

While it might be possible to make the calls go slightly faster, not really. You don’t show enough code for me to be certain, but I suspect that your timings for malloc/free on the CPU are fictions, the result of a carefully woven web of deceit spun by the OS. You see, when you call malloc on a machine with a virtual memory system, you don’t really allocate any memory. The OS just marks off pages as being available for use by your application. It’s only when you start accessing those pages that the VM system starts mapping them to something ‘real.’ As a result, the first access to that memory will be slow.

In contrast, [font=“Courier New”]cudaMalloc[/font] on the GPU has to make sure that there’s really that block of memory available right now, which takes a little extra time. And [font=“Courier New”]cudaMallocHost[/font] has it even worse, since it has to allocate real, contiguous transistors almost in defiance of the usual VM system. If those transistors aren’t free, then the OS will have to swap out their contents first, before the call can return.

The code for testing Linux’s malloc/free speed is:

==============================================

xc = (float *)malloc(n * sizeof(float));

free(xc);

==============================================

I think You are right. The OS just marks off pages. Thanks for you explanation.

monramax · November 9, 2010, 5:46pm

While it might be possible to make the calls go slightly faster, not really. You don’t show enough code for me to be certain, but I suspect that your timings for malloc/free on the CPU are fictions, the result of a carefully woven web of deceit spun by the OS. You see, when you call malloc on a machine with a virtual memory system, you don’t really allocate any memory. The OS just marks off pages as being available for use by your application. It’s only when you start accessing those pages that the VM system starts mapping them to something ‘real.’ As a result, the first access to that memory will be slow.

In contrast, [font=“Courier New”]cudaMalloc[/font] on the GPU has to make sure that there’s really that block of memory available right now, which takes a little extra time. And [font=“Courier New”]cudaMallocHost[/font] has it even worse, since it has to allocate real, contiguous transistors almost in defiance of the usual VM system. If those transistors aren’t free, then the OS will have to swap out their contents first, before the call can return.

The code for testing Linux’s malloc/free speed is:

==============================================

xc = (float *)malloc(n * sizeof(float));

free(xc);

==============================================

I think You are right. The OS just marks off pages. Thanks for you explanation.

Topic		Replies	Views
cudamalloc slow on Kepler K10 CUDA Programming and Performance	9	1104	October 28, 2014
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2109	August 26, 2011
Slow cudaMalloc (~1.5s) and slow mem access there, allocating nearly whole memory, with WDDM CUDA Programming and Performance	0	1090	June 18, 2014
cudaHostAlloc: Pinned memory creation very slow! CUDA Programming and Performance	7	7593	January 5, 2012
cudaMalloc, cudaFree speed CUDA Programming and Performance	2	3574	April 4, 2013
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2362	September 3, 2010
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8772	November 17, 2021
cudaMallocManaged allocating more memory than requested CUDA Programming and Performance	7	3137	July 13, 2018
cudaMalloc error in big loop CUDA Programming and Performance	12	15588	May 21, 2008
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	411	March 28, 2024

cudaMalloc's speed is two slow

========================================================== cudaMalloc((void **) &xc, n * sizeof(float)); cudaFree(xc);

========================================================== cudaMalloc((void **) &xc, n * sizeof(float)); cudaFree(xc);

Related topics

==========================================================
cudaMalloc((void **) &xc, n * sizeof(float));
cudaFree(xc);

==========================================================
cudaMalloc((void **) &xc, n * sizeof(float));
cudaFree(xc);