cudaHostAlloc: Pinned memory creation very slow!


I am allocating a pinned memory (216 MB) using CudaHostAlloc (CUDA 4.0). It takes about 3 seconds! Is that normal?? Malloc takes about 300ms. I was expecting to be slower, but is it normal to have it 10 times slower?


Is cudaHostAlloc() the first CUDA function you call? On some platforms, CUDA can take a long time to initialize, so the problem you are seeing might not be cudaHostAlloc, but rather the implied CUDA context creation.

Is this Linux or Windows?

1 Like

No it is not the first CUDA function. I allocate memory and create streams before this call. (It is Linux - Fedora Core 13. The workstation has 48 cores (8x 6core AMD Istanbul CPUs) and 128 GB of memory, with a TESLA C1060).

malloc takes 300ms? something seems very wrong with your system. (malloc only updates PTEs and does not allocate pages to your process until you take page faults on those pages, so it will always be faster than something that allocates pinned memory like cudaHostAlloc)

I don’t think that there is something wrong with that. It is an old system, bus speed, ram speed etc are limited. On a Sandy Bridge Intel architecture, it takes about 100 ms to allocate 216 MB.

Unless you are timing something else besides the allocation, this is pretty surprising. On my Core 2 Duo laptop (Mac OS X), I can allocate 216 MB in Python in 37 microseconds:

import numpy as np

import time

start = time.time()

a = np.empty(shape=216*1024*1024, dtype=np.int8)

end = time.time()

print 'alloc: %fus' % ( (end-start)*1e6 ), a.nbytes

Which prints:

drl036:~ stan$ python 

alloc 36.954880us 226492416

Compared to that, 100 milliseconds is a really long time. I should go try this in Linux…

OK, so I tried this on a Linux system (Ubuntu 11.04 64-bit, CUDA 4.0):

import pycuda.autoinit

import pycuda.driver

import numpy as np

import time

size = 216*1024*1024

start = time.time()

a = np.empty(shape=size, dtype=np.int8)

end = time.time()

print 'Normal malloc %fus' % ( (end-start)*1e6 ), a.nbytes

start = time.time()

a_pagelocked = pycuda.driver.pagelocked_empty(shape=size, dtype=np.int8)

end = time.time()

print 'Pinned malloc %fus' % ( (end-start)*1e6 ), a_pagelocked.nbytes

And I get this:

(env)stan@gonzales:~$ python 

Normal malloc 15.020370us 226492416

Pinned malloc 81710.100174us 226492416

So the pinned malloc is quite a bit slower than the normal malloc on my system, but the overall speed is much faster than what you are seeing.

I forgot that I changed “malloc” with “new” operator. Sorry. So these timings were for “new” operator. Malloc is much faster (like your results).

this is slower:

complex *tmp;

long int sz = 1024 * 1024 * 216 / sizeof(complex);

// start

tmp = new complex[sz];

// stop

this is faster:

char *tmp;

long int sz = 1024 * 1024 * 216;

// start

tmp = new char[sz];

// stop

Even though they allocate same amount of memory, new operator on complex is way slower because of the object initializations.

But this doesn’t explain why cudaHostAlloc takes that long. 3 seconds is insane. I’ll check system log file to see if there are any hardware errors.

Do you think that it will change anything if I use driver API? Is it possible that cudaHostAlloc behaves like new operator with complex type?