Why does cudaMallocHost takes so muck time compared to malloc?

Can anybody explain to me why does cudaMallocHost takes so muck time compared to malloc?
Is it because the memory locations have to be continuous?
What is the series of operation in a cudaMallocHost?

Thank you for your time

Apostolis Glenis

Just to check something: Can you run two cudaMallocHost() functions in a row and time only the second call? I wonder if you are also seeing the CUDA context creation overhead in the first call.

i haven’t thought of that.
Thank you

But i would also like an answer to my other question,because according to my tests you need approximately 1GB dataset to overcome the overhead by cudaHostAlloc.

Apostolis

Something is seriously wrong with either your tests or your CUDA system if that is the case. My testing with 64 bit Linux running CUDA 4.0 release version shows a single cudaMallocHost or cudaHostAlloc call takes less than a millisecond to complete. You are not seriously suggesting that it requires 1Gb of transfers to amortize 700 microseconds of host API latency, are you?

For reference, this is the code I used to benchmark the calls:

#include <sys/time.h>

#include <unistd.h>

#include <stdlib.h>

#include <stdio.h>

#include <assert.h>

double	systimer(void)

{

	struct		timeval tp;

	register double	result=0.0;

	if ( gettimeofday(&tp, NULL) != -1 )

	{

		result = (double)(tp.tv_sec);

		result += ((double)(tp.tv_usec))*1.0e-6;

	}

	return result;

}

int main(void)

{

    const size_t bsize = 2<<20; // 2 Mib buffer 

    const int nbuffers = 10;

    float *buffers[nbuffers];

(void)cudaFree(0); // Establish context

// cudaMallocHost timing

    {

        double stime = systimer();

        for(int i=0; i<nbuffers;i++) {

            assert( cudaMallocHost(&buffers[i], bsize) == cudaSuccess );

        }

        double etime = systimer();

fprintf(stdout,"%d cudaMallocHost calls of %ld bytes each cost %f\n",

                nbuffers, long(bsize), etime-stime);

for(int i=0; i<nbuffers;i++) {

            assert( cudaFreeHost(buffers[i]) == cudaSuccess );

        }

    }

// cudaHostAlloc timing

    {

        double stime = systimer();

        for(int i=0; i<nbuffers;i++) {

            assert( cudaHostAlloc(&buffers[i], bsize, cudaHostAllocDefault)

                    == cudaSuccess );

        }

        double etime = systimer();

fprintf(stdout,"%d cudaHostAlloc calls of %ld bytes each cost %f\n",

                nbuffers, long(bsize), etime-stime);

for(int i=0; i<nbuffers;i++) {

            assert( cudaFreeHost(buffers[i]) == cudaSuccess );

        }

    }

cudaThreadExit();

}
cuda:~$ ./cudamalloc 

10 cudaMallocHost calls of 2097152 bytes each cost 0.007455

10 cudaHostAlloc calls of 2097152 bytes each cost 0.007547

cuda:~$ ./cudamalloc 

10 cudaMallocHost calls of 2097152 bytes each cost 0.007537

10 cudaHostAlloc calls of 2097152 bytes each cost 0.007319

cuda:~$ ./cudamalloc 

10 cudaMallocHost calls of 2097152 bytes each cost 0.007265

10 cudaHostAlloc calls of 2097152 bytes each cost 0.007397

What I was getting at is that the first CUDA function call (whatever it is) in your code has to incur the overhead of context creation. If cudaMallocHost is the first CUDA call in your code, then it takes all the overhead, but all later calls will be quicker. That is different than cudaMallocHost having some intrinsic overhead whenever you use it.

I am not sure you understood where I was going with it.

using the two files i am attaching i got the following results:

time elapsed to allocate 128 MB using cudaMallocHost was 39.915905 milliseconds 

time elapsed to clean 128 MB using cudaMallocHost was 48.702785 milliseconds 

time elapsed to allocate 128 MB using Malloc was 0.011872 milliseconds 

time elapsed to clean 128 MB using free was 0.008256 milliseconds 

time elapsed to allocate 256 MB using cudaMallocHost was 79.764030 milliseconds 

time elapsed to clean 256 MB using cudaMallocHost was 97.940384 milliseconds 

time elapsed to allocate 256 MB using Malloc was 0.028352 milliseconds 

time elapsed to clean 256 MB using free was 0.015264 milliseconds 

time elapsed to allocate 512 MB using cudaMallocHost was 158.175903 milliseconds 

time elapsed to clean 512 MB using cudaMallocHost was 195.915619 milliseconds 

time elapsed to allocate 512 MB using Malloc was 0.027616 milliseconds 

time elapsed to clean 512 MB using free was 0.014560 milliseconds 

time elapsed to allocate 768 MB using cudaMallocHost was 241.382751 milliseconds 

time elapsed to clean 768 MB using cudaMallocHost was 292.583832 milliseconds 

time elapsed to allocate 768 MB using Malloc was 0.035904 milliseconds 

time elapsed to clean 768 MB using free was 0.017696 milliseconds 

time elapsed to allocate 1024 MB using cudaMallocHost was 316.563416 milliseconds 

time elapsed to clean 1024 MB using cudaMallocHost was 391.105652 milliseconds 

time elapsed to allocate 1024 MB using Malloc was 0.026240 milliseconds 

time elapsed to clean 1024 MB using free was 0.015936 milliseconds
time elapsed to allocate 128 MB using Malloc and then pin was 41.733601 milliseconds 

time elapsed to clean 128 MB using unpin and then free was 49.517345 milliseconds 

time elapsed to allocate 256 MB using Malloc and then pin was 84.328445 milliseconds 

time elapsed to clean 256 MB using unpin and then free was 99.042114 milliseconds 

time elapsed to allocate 512 MB using Malloc and then pin was 168.566620 milliseconds 

time elapsed to clean 512 MB using unpin and then free was 198.179459 milliseconds 

time elapsed to allocate 768 MB using Malloc and then pin was 253.400162 milliseconds 

time elapsed to clean 768 MB using unpin and then free was 296.367126 milliseconds 

time elapsed to allocate 1024 MB using Malloc and then pin was 339.775055 milliseconds 

time elapsed to clean 1024 MB using unpin and then free was 397.423126 milliseconds 

time elapsed to allocate 128 MB using Malloc and then pin was 41.725792 milliseconds 

time elapsed to clean 128 MB using unpin and then free was 49.381985 milliseconds 

time elapsed to allocate 256 MB using Malloc and then pin was 84.429916 milliseconds 

time elapsed to clean 256 MB using unpin and then free was 98.788963 milliseconds 

time elapsed to allocate 512 MB using Malloc and then pin was 175.949432 milliseconds 

time elapsed to clean 512 MB using unpin and then free was 197.737122 milliseconds 

time elapsed to allocate 1024 MB using Malloc and then pin was 336.332123 milliseconds 

time elapsed to clean 1024 MB using unpin and then free was 396.497070 milliseconds

I mean’t that cudaMallocHost takes that much time more than simple malloc that to compensate for that overhead via the faster transfer (which happens 2 times usually,h2d and then d2h) the dataset should be bigger than 700MB or so.(assumming 1 two way communication as i said).

Could anybody explain to me why the cuda runtime takes that much time to pin a region of memory?

Thank you in advance

Apostolis

cudaHost.cu (1.14 KB)

cudaHost2.cu (869 Bytes)

Found the measurements i was looking for:
http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_management_overhead.html

Can anybody explain to me why this is happening?

Apostolis

  1. malloc doesn’t necessarily add a physical backing; that’s usually deferred until you touch the pages for the first time. cudaMallocHost has to provide a physical backing at allocation time because it returns pinned memory.
  2. cudaMallocHost maps the pages into the GPU’s address space, which involves updating a lot of things on the other side of the PCIe bus.

Thanks Tim, that was very helpful and informative.
BTW i was under the impression that the mapping to GPU address space was done when asked by the runtime to provide device pointer for the pinned memory (to do Zero-Copy) not when the memory was allocated on the host.

It’s mapped either way so the GPU can DMA to/from the buffer. Zero-copy implies some additional requirements, though.