cudaHostAlloc: Pinned memory creation very slow!

_kirpi · January 3, 2012, 4:15pm

Hi,

I am allocating a pinned memory (216 MB) using CudaHostAlloc (CUDA 4.0). It takes about 3 seconds! Is that normal?? Malloc takes about 300ms. I was expecting to be slower, but is it normal to have it 10 times slower?

Thanks!

seibert · January 3, 2012, 5:05pm

Is cudaHostAlloc() the first CUDA function you call? On some platforms, CUDA can take a long time to initialize, so the problem you are seeing might not be cudaHostAlloc, but rather the implied CUDA context creation.

Is this Linux or Windows?

_kirpi · January 3, 2012, 5:22pm

No it is not the first CUDA function. I allocate memory and create streams before this call. (It is Linux 2.6.34.9 - Fedora Core 13. The workstation has 48 cores (8x 6core AMD Istanbul CPUs) and 128 GB of memory, with a TESLA C1060).

tmurray · January 3, 2012, 10:01pm

malloc takes 300ms? something seems very wrong with your system. (malloc only updates PTEs and does not allocate pages to your process until you take page faults on those pages, so it will always be faster than something that allocates pinned memory like cudaHostAlloc)

_kirpi · January 4, 2012, 9:30pm

I don’t think that there is something wrong with that. It is an old system, bus speed, ram speed etc are limited. On a Sandy Bridge Intel architecture, it takes about 100 ms to allocate 216 MB.

seibert · January 4, 2012, 10:44pm

Unless you are timing something else besides the allocation, this is pretty surprising. On my Core 2 Duo laptop (Mac OS X), I can allocate 216 MB in Python in 37 microseconds:

import numpy as np

import time

start = time.time()

a = np.empty(shape=216*1024*1024, dtype=np.int8)

end = time.time()

print 'alloc: %fus' % ( (end-start)*1e6 ), a.nbytes

Which prints:

drl036:~ stan$ python malloc.py 

alloc 36.954880us 226492416

Compared to that, 100 milliseconds is a really long time. I should go try this in Linux…

seibert · January 4, 2012, 11:08pm

OK, so I tried this on a Linux system (Ubuntu 11.04 64-bit, CUDA 4.0):

import pycuda.autoinit

import pycuda.driver

import numpy as np

import time

size = 216*1024*1024

start = time.time()

a = np.empty(shape=size, dtype=np.int8)

end = time.time()

print 'Normal malloc %fus' % ( (end-start)*1e6 ), a.nbytes

start = time.time()

a_pagelocked = pycuda.driver.pagelocked_empty(shape=size, dtype=np.int8)

end = time.time()

print 'Pinned malloc %fus' % ( (end-start)*1e6 ), a_pagelocked.nbytes

And I get this:

(env)stan@gonzales:~$ python malloc.py 

Normal malloc 15.020370us 226492416

Pinned malloc 81710.100174us 226492416

So the pinned malloc is quite a bit slower than the normal malloc on my system, but the overall speed is much faster than what you are seeing.

_kirpi · January 5, 2012, 1:30pm

I forgot that I changed “malloc” with “new” operator. Sorry. So these timings were for “new” operator. Malloc is much faster (like your results).

this is slower:

complex *tmp;

long int sz = 1024 * 1024 * 216 / sizeof(complex);

// start

tmp = new complex[sz];

// stop

this is faster:

char *tmp;

long int sz = 1024 * 1024 * 216;

// start

tmp = new char[sz];

// stop

Even though they allocate same amount of memory, new operator on complex is way slower because of the object initializations.

But this doesn’t explain why cudaHostAlloc takes that long. 3 seconds is insane. I’ll check system log file to see if there are any hardware errors.

Do you think that it will change anything if I use driver API? Is it possible that cudaHostAlloc behaves like new operator with complex type?

OK, so I tried this on a Linux system (Ubuntu 11.04 64-bit, CUDA 4.0):

import pycuda.autoinit

import pycuda.driver

import numpy as np

import time

size = 216*1024*1024

start = time.time()

a = np.empty(shape=size, dtype=np.int8)

end = time.time()

print 'Normal malloc %fus' % ( (end-start)*1e6 ), a.nbytes

start = time.time()

a_pagelocked = pycuda.driver.pagelocked_empty(shape=size, dtype=np.int8)

end = time.time()

print 'Pinned malloc %fus' % ( (end-start)*1e6 ), a_pagelocked.nbytes

And I get this:

(env)stan@gonzales:~$ python malloc.py 

Normal malloc 15.020370us 226492416

Pinned malloc 81710.100174us 226492416

So the pinned malloc is quite a bit slower than the normal malloc on my system, but the overall speed is much faster than what you are seeing.

Topic		Replies	Views
cudaMallocHost() vs. malloc() 1st "cudaMallocHost()" lasts ~90ms!! CUDA Programming and Performance	5	15181	July 3, 2007
Is cudaHostAlloc() fast? CUDA Programming and Performance	4	974	March 28, 2024
Memory allocation time CUDA Programming and Performance	0	1291	January 19, 2008
malloc() + cuMemHostRegister() faster than cuMemAllocHost() CUDA Programming and Performance	0	1144	October 9, 2013
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2336	August 26, 2011
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	9093	November 17, 2021
cudaHostAlloc question CUDA Programming and Performance	3	715	August 29, 2019
Pinned Memory slower than pageable memory CUDA Programming and Performance	4	3347	September 16, 2010
CPU operation is very slow on memory allocated by cudaMallocHost TensorRT	1	889	October 8, 2018
cudaHostAlloc - very slow the first time CUDA Programming and Performance	2	2980	April 26, 2012

cudaHostAlloc: Pinned memory creation very slow!

Related topics