==========================================================
cudaMalloc((void **) &xc, n * sizeof(float));
cudaFree(xc);
When I allocated 10 * sizeof(float) memory, the time the code above used was about 6e-3s, and Linux just used 4e-8s (malloc, free). cudaMallocHost/cudaFreeHost even took longer time than cudaMalloc/cudaFree.
The test was performed under Fedora 13, gcc-4.4, cuda 3.2 and Tesla C2050.
Do we have any method to improve these functions’ speed? Thanks.
==========================================================
cudaMalloc((void **) &xc, n * sizeof(float));
cudaFree(xc);
When I allocated 10 * sizeof(float) memory, the time the code above used was about 6e-3s, and Linux just used 4e-8s (malloc, free). cudaMallocHost/cudaFreeHost even took longer time than cudaMalloc/cudaFree.
The test was performed under Fedora 13, gcc-4.4, cuda 3.2 and Tesla C2050.
Do we have any method to improve these functions’ speed? Thanks.
While it might be possible to make the calls go slightly faster, not really. You don’t show enough code for me to be certain, but I suspect that your timings for malloc/free on the CPU are fictions, the result of a carefully woven web of deceit spun by the OS. You see, when you call malloc on a machine with a virtual memory system, you don’t really allocate any memory. The OS just marks off pages as being available for use by your application. It’s only when you start accessing those pages that the VM system starts mapping them to something ‘real.’ As a result, the first access to that memory will be slow.
In contrast, [font=“Courier New”]cudaMalloc[/font] on the GPU has to make sure that there’s really that block of memory available right now, which takes a little extra time. And [font=“Courier New”]cudaMallocHost[/font] has it even worse, since it has to allocate real, contiguous transistors almost in defiance of the usual VM system. If those transistors aren’t free, then the OS will have to swap out their contents first, before the call can return.
While it might be possible to make the calls go slightly faster, not really. You don’t show enough code for me to be certain, but I suspect that your timings for malloc/free on the CPU are fictions, the result of a carefully woven web of deceit spun by the OS. You see, when you call malloc on a machine with a virtual memory system, you don’t really allocate any memory. The OS just marks off pages as being available for use by your application. It’s only when you start accessing those pages that the VM system starts mapping them to something ‘real.’ As a result, the first access to that memory will be slow.
In contrast, [font=“Courier New”]cudaMalloc[/font] on the GPU has to make sure that there’s really that block of memory available right now, which takes a little extra time. And [font=“Courier New”]cudaMallocHost[/font] has it even worse, since it has to allocate real, contiguous transistors almost in defiance of the usual VM system. If those transistors aren’t free, then the OS will have to swap out their contents first, before the call can return.