Why does cudaMallocHost takes so muck time compared to malloc?

apostglen46 · August 22, 2011, 2:24pm

Can anybody explain to me why does cudaMallocHost takes so muck time compared to malloc?
Is it because the memory locations have to be continuous?
What is the series of operation in a cudaMallocHost?

Thank you for your time

Apostolis Glenis

seibert · August 22, 2011, 5:53pm

Just to check something: Can you run two cudaMallocHost() functions in a row and time only the second call? I wonder if you are also seeing the CUDA context creation overhead in the first call.

apostglen46 · August 23, 2011, 6:00am

i haven’t thought of that.
Thank you

But i would also like an answer to my other question,because according to my tests you need approximately 1GB dataset to overcome the overhead by cudaHostAlloc.

Apostolis

avidday · August 23, 2011, 9:05am

Something is seriously wrong with either your tests or your CUDA system if that is the case. My testing with 64 bit Linux running CUDA 4.0 release version shows a single cudaMallocHost or cudaHostAlloc call takes less than a millisecond to complete. You are not seriously suggesting that it requires 1Gb of transfers to amortize 700 microseconds of host API latency, are you?

For reference, this is the code I used to benchmark the calls:

#include <sys/time.h>

#include <unistd.h>

#include <stdlib.h>

#include <stdio.h>

#include <assert.h>

double	systimer(void)

{

	struct		timeval tp;

	register double	result=0.0;

	if ( gettimeofday(&tp, NULL) != -1 )

	{

		result = (double)(tp.tv_sec);

		result += ((double)(tp.tv_usec))*1.0e-6;

	}

	return result;

}

int main(void)

{

    const size_t bsize = 2<<20; // 2 Mib buffer 

    const int nbuffers = 10;

    float *buffers[nbuffers];

(void)cudaFree(0); // Establish context

// cudaMallocHost timing

    {

        double stime = systimer();

        for(int i=0; i<nbuffers;i++) {

            assert( cudaMallocHost(&buffers[i], bsize) == cudaSuccess );

        }

        double etime = systimer();

fprintf(stdout,"%d cudaMallocHost calls of %ld bytes each cost %f\n",

                nbuffers, long(bsize), etime-stime);

for(int i=0; i<nbuffers;i++) {

            assert( cudaFreeHost(buffers[i]) == cudaSuccess );

        }

    }

// cudaHostAlloc timing

    {

        double stime = systimer();

        for(int i=0; i<nbuffers;i++) {

            assert( cudaHostAlloc(&buffers[i], bsize, cudaHostAllocDefault)

                    == cudaSuccess );

        }

        double etime = systimer();

fprintf(stdout,"%d cudaHostAlloc calls of %ld bytes each cost %f\n",

                nbuffers, long(bsize), etime-stime);

for(int i=0; i<nbuffers;i++) {

            assert( cudaFreeHost(buffers[i]) == cudaSuccess );

        }

    }

cudaThreadExit();

}

cuda:~$ ./cudamalloc 

10 cudaMallocHost calls of 2097152 bytes each cost 0.007455

10 cudaHostAlloc calls of 2097152 bytes each cost 0.007547

cuda:~$ ./cudamalloc 

10 cudaMallocHost calls of 2097152 bytes each cost 0.007537

10 cudaHostAlloc calls of 2097152 bytes each cost 0.007319

cuda:~$ ./cudamalloc 

10 cudaMallocHost calls of 2097152 bytes each cost 0.007265

10 cudaHostAlloc calls of 2097152 bytes each cost 0.007397

seibert · August 23, 2011, 1:28pm

What I was getting at is that the first CUDA function call (whatever it is) in your code has to incur the overhead of context creation. If cudaMallocHost is the first CUDA call in your code, then it takes all the overhead, but all later calls will be quicker. That is different than cudaMallocHost having some intrinsic overhead whenever you use it.

apostglen46 · August 24, 2011, 10:01am

I am not sure you understood where I was going with it.

using the two files i am attaching i got the following results:

time elapsed to allocate 128 MB using cudaMallocHost was 39.915905 milliseconds 

time elapsed to clean 128 MB using cudaMallocHost was 48.702785 milliseconds 

time elapsed to allocate 128 MB using Malloc was 0.011872 milliseconds 

time elapsed to clean 128 MB using free was 0.008256 milliseconds 

time elapsed to allocate 256 MB using cudaMallocHost was 79.764030 milliseconds 

time elapsed to clean 256 MB using cudaMallocHost was 97.940384 milliseconds 

time elapsed to allocate 256 MB using Malloc was 0.028352 milliseconds 

time elapsed to clean 256 MB using free was 0.015264 milliseconds 

time elapsed to allocate 512 MB using cudaMallocHost was 158.175903 milliseconds 

time elapsed to clean 512 MB using cudaMallocHost was 195.915619 milliseconds 

time elapsed to allocate 512 MB using Malloc was 0.027616 milliseconds 

time elapsed to clean 512 MB using free was 0.014560 milliseconds 

time elapsed to allocate 768 MB using cudaMallocHost was 241.382751 milliseconds 

time elapsed to clean 768 MB using cudaMallocHost was 292.583832 milliseconds 

time elapsed to allocate 768 MB using Malloc was 0.035904 milliseconds 

time elapsed to clean 768 MB using free was 0.017696 milliseconds 

time elapsed to allocate 1024 MB using cudaMallocHost was 316.563416 milliseconds 

time elapsed to clean 1024 MB using cudaMallocHost was 391.105652 milliseconds 

time elapsed to allocate 1024 MB using Malloc was 0.026240 milliseconds 

time elapsed to clean 1024 MB using free was 0.015936 milliseconds

time elapsed to allocate 128 MB using Malloc and then pin was 41.733601 milliseconds 

time elapsed to clean 128 MB using unpin and then free was 49.517345 milliseconds 

time elapsed to allocate 256 MB using Malloc and then pin was 84.328445 milliseconds 

time elapsed to clean 256 MB using unpin and then free was 99.042114 milliseconds 

time elapsed to allocate 512 MB using Malloc and then pin was 168.566620 milliseconds 

time elapsed to clean 512 MB using unpin and then free was 198.179459 milliseconds 

time elapsed to allocate 768 MB using Malloc and then pin was 253.400162 milliseconds 

time elapsed to clean 768 MB using unpin and then free was 296.367126 milliseconds 

time elapsed to allocate 1024 MB using Malloc and then pin was 339.775055 milliseconds 

time elapsed to clean 1024 MB using unpin and then free was 397.423126 milliseconds 

time elapsed to allocate 128 MB using Malloc and then pin was 41.725792 milliseconds 

time elapsed to clean 128 MB using unpin and then free was 49.381985 milliseconds 

time elapsed to allocate 256 MB using Malloc and then pin was 84.429916 milliseconds 

time elapsed to clean 256 MB using unpin and then free was 98.788963 milliseconds 

time elapsed to allocate 512 MB using Malloc and then pin was 175.949432 milliseconds 

time elapsed to clean 512 MB using unpin and then free was 197.737122 milliseconds 

time elapsed to allocate 1024 MB using Malloc and then pin was 336.332123 milliseconds 

time elapsed to clean 1024 MB using unpin and then free was 396.497070 milliseconds

I mean’t that cudaMallocHost takes that much time more than simple malloc that to compensate for that overhead via the faster transfer (which happens 2 times usually,h2d and then d2h) the dataset should be bigger than 700MB or so.(assumming 1 two way communication as i said).

Could anybody explain to me why the cuda runtime takes that much time to pin a region of memory?

Thank you in advance

Apostolis

cudaHost.cu (1.14 KB)

cudaHost2.cu (869 Bytes)

apostglen46 · August 25, 2011, 11:56am

Found the measurements i was looking for:
http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_management_overhead.html

Can anybody explain to me why this is happening?

Apostolis

tmurray · August 25, 2011, 5:11pm

malloc doesn’t necessarily add a physical backing; that’s usually deferred until you touch the pages for the first time. cudaMallocHost has to provide a physical backing at allocation time because it returns pinned memory.
cudaMallocHost maps the pages into the GPU’s address space, which involves updating a lot of things on the other side of the PCIe bus.

apostglen46 · August 26, 2011, 10:05am

Thanks Tim, that was very helpful and informative.
BTW i was under the impression that the mapping to GPU address space was done when asked by the runtime to provide device pointer for the pinned memory (to do Zero-Copy) not when the memory was allocated on the host.

tmurray · August 26, 2011, 4:54pm

It’s mapped either way so the GPU can DMA to/from the buffer. Zero-copy implies some additional requirements, though.

Topic		Replies	Views
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8869	November 17, 2021
cudaMallocHost() vs. malloc() 1st "cudaMallocHost()" lasts ~90ms!! CUDA Programming and Performance	5	15072	July 3, 2007
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23177	July 8, 2011
Can I create a pinned memory buffer to support overlapping compute/copy without cudaMallocHost overhead CUDA Programming and Performance cuda	13	819	November 3, 2020
cudaHostAlloc: Pinned memory creation very slow! CUDA Programming and Performance	7	7617	January 5, 2012
cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available) CUDA Programming and Performance	5	974	July 26, 2022
Memcopy between two cuda managed addresses much faster than between cuda managed and not managed CUDA Programming and Performance	3	39	March 22, 2025
Why is the execution time of cudaMalloc so variable (when using hotspot benchmark from Rodinia Benchmark suite)? CUDA Programming and Performance	2	271	October 2, 2023
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1998	April 18, 2017
First cudaMalloc() takes long time? CUDA Programming and Performance	13	17173	April 23, 2021

Why does cudaMallocHost takes so muck time compared to malloc?

Related topics